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> Emanuel  Parzen 

j i-  Texas  A&M  University 

! L 

The  workshop  on  "Density  Estimation  and  Function  Smoothing"  held 
l*  at  Texas  A&M  University  on  March  11-13,  1982  under  the  sponsorshop  of 

i 

r NASA,  provided  the  occasion  for  a cross-section  of  mathematical  scien- 

i 

i. 

tists  involved  in  this  field  to  meet  for  an  intensive  sharing  of  results 
and  viewpoints.  All  participants  regarded  the  workshop  as  an  unusually 
warm,  stimulating,  and  productive  experience.  The  papers  collected  in 

1 

i this  volume  provide  written  versions  of  the  papers  presented,  enabling 

{■  a wide  audience  to  enjoy  the  excitement  experienced  at  the  workshop  in 

being  able  to  learn  about  the  diverse  research  directions  that  consti- 
j tute  the  current  state  of  the  art  in  the  statistical  discipline  of  density 

i - 

estimation  and  function  smoothing. 

i • 

One  conclusion  to  be  drawn  from  these  papers  is  that  solutions  to 
j-  problems  of  density  estimation  and  function  smoothing  involve  aspects 

of  theoretical  and  applied  mathematics,  probability  and  statistics, 

1 

; numerical  analysis  and  computer  science,  information  theory  and  approxi- 

mation theory,  as  well  as  the  scientific  fields  such  as  meteorology  and 

i 

' remote  sensing.  I believe  this  field  of  mathematical  science  merits  a 

- name  of  its  own,  and  I propose  "statistical  functional  inference."  I 

believe  that  statistical  model  identification  techniques  are  required  to 
t develop  and  implement  workable  practical  solutions  to  problems  in  density 

estimation  and  function  smoothing.  There  is  reason  to  believe  that  the 
techniques  being  developed  by  the  workshop  participants  will  ultimately 
r • prove  to  be  of  great  value  in  accomplishing  the  objectives  of  NASA. 


T • 


The  papers  collected  here  are  extremely  rich  in  content,  and  it  is 
impossible  to  convey  their  importance  in  a few  summary  sentences.  Never- 
theless, to  help  the  reader  obtain  an  overview  of  each  paper  I have 
written  a short  description  of  each. 

Devroy  takes  a critical  look  at  mathematical  results  on  the  con- 
vergence of  estimators  of  a probability  density  f on  Rd  from  a random 
sample  x-j ,. . . ,xn. 

Geman  provides  insight  about  the  problem  of  choosing  a smoothing 
parameter  by  cross-validation. 

McClure  discusses  estimation  of  a planar  convex  region  from  projec- 
tions of  counts  of  events  which  are  Poisson  distributed  at  different 
rates  inside  and  outside  the  region. 

Geman  and  McClure  relate  kernel  type  density  estimators  to  maximum 
likelihood  density  estimators  calculated  by  the  method  of  sieves. 

O'Sullivan  discusses  how  methods  of  regularized  and  generalized  cross- 
validation  can  be  used  to  estimate  the  atmosphere's  temperature,  moisture, 
and  wind  structure  from  a finite  number  of  noisy  measurements  by  meteorol- 
ogical satellites  on  the  intensity  of  upwelling  radiation  in  selected 
channel  frequencies. 

Parzen  presents  an  approach  to  statistical  data  science  based  on 
quantile  functions,  density-quantile  functions,  and  information  and  entropy 
measures.  He  outlines  a new  approach  to  density  estimation  based  on  using 
exponential  probability  densities  as  exact  and  approximate  models. 

Peters  discusses,  for  a probability  model  of  a finite  mixture  of 
multivariate  distributions,  the  asymptotic  consistency,  normality,  and 
efficiency  of  the  maximum  likelihood  estimators  of  the  parameters  of  this 


model . 


An  important  technique  of  estimating  a smooth  function  g(t)  given 
data  values  x^,  i = l,...,n  which  are  noisy  measurements  of  A g ( t - ) . 
for  a known  linear  operator  A,  is  to  choose  g to  minimize 

i J {x,  - A g (t,)}2  + X /J  lg"(t)|2  dt 

Rice  and  Rosenblatt  examine  this  procedure  in  the  cases  of  numerical  dif- 
ferentiation and  deconvolution. 

Schuster  summarizes  results  reported  in  several  papers  by  Schuster 
and  Gregory  on  their  experience  in  applying  non-parametric  maximum  like- 
lihood techniques  of  density  estimation  to  judge  the  comparative  quality 
of  various  estimators. 

Scott  summarizes  his  experience  in  comparing  the  effects  of  smoothing 
parameters  on  probability  density  estimators  for  univariate  and  bivariate 
data. 

Silverman  introduces,  and  discusses  the  asymptotic  behavior  of,  a 
test  statistic  for  hypotheses  concerning  the  number  of  modes  in  a proba- 
bility density. 

Thompson  introduces  a method  for  generating  random  vectors  from  the 
distribution  of  a random  vector  x which  is  based  on  a random  sample  of 
x without  estimating  the  underlying  density. 

Redner  and  Walker  review  the  theory  of  estimation  of  parameters  of 
mixture  aensity  models,  and  discuss  in  detail  iterative  procedures  for 
numerical  approximation  of  maximum  likelihood  estimates  based  on  the  EM 
algorithm. 


iiu 


Yakowitz  and  Sf*darovszky  provide  a comprehensive  review  of  "krig- 
ing"  methods  for  fitting  functions  to  spatial  data. 

Wendelberger  discusses  multidimensional  smoothing  splines,  the 
method  of  generalized  cross-validation,  and  applications  to  meteorology. 
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Thursday,  March  11: 

8:15  Coffee  and  donuts 

8:30  Remote  Sensing  Fundamental  Research  Program:  An  Overview 
R.  8.  MacDonald,  NASA/Johnson  Space  Center 
L.  F.  Ciuseman,  Jr.,  Texas  A&M  University 

9:20  Topics  in  Global  Convergence  of  Density  Estimates 
Luc  Devroye,  McGill  University 

10:10  Coffee 

10:30  Smoothing  Splines:  Regression,  Derivatives,  and  Deconvolution 
John  Rice,  University  of  California  at  San  Diego 

11:20  On  Statistics  and  Density  Estimation 

Herbert  Robbins,  Columbia  University 

12:00  Lunch 

1:40  A Bootstrap  Approach  to  Bump  Hunting 

B.  W.  Silverman,  University  of  Bath 

2:30  Cross  Validation  for  Densities  and  Regressions 

Stu  Geman,  Brown  University 

3:20  Coffee 

3:40  Estimation  of  Planar  Sets  from  Poisson  Projections 
Donald  McClure,  Brown  University 

4:30  Quantiles,  Parametric-Select  Density  Estimation,  and  Bi -Information 

Density  Estimators 

Emanuel  Parzen,  Texas  A&M  University 
6:15  Assemble  in  Aggieland  Inn  Lobby  for  Cocktails  & Dinner 


i ' 

Friday,  March  12: 


i 

i 


8:15  Coffee  and  donuts 

8:30  Considerations  in  Cross  Validation  Type  Density  Smoothing  with 
a Look  at  Some  Data 

Eugene  F.  Schuster,  University  of  Texas  at  El  Paso 
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9:20  Nonparametric  Regression  and  Kriging  Methods  for  Spatial  Data 
Sidney  Yakowitz,  University  of  Arizona 

10:10  Coffee 

10:30  Multidimensional  Smoothing  Splines  and  Their  Restriction  to  the 

Sphere 

Jim  Wendelberger,  University  of  Wisconsin— Madison 

11:20  Remote  Sensing  of  Temperature  Profiles  in  the  Atmosphere 
Finbarr  O'Sullivan,  University  of  Wisconsin— Madison 

12:00  Lunch 

1:40  A Data  Based  Random  Number  Generator  for  a Multivare  Distribution 

J.  R.  Thompson,  Rice  University 

2:30  Review  of  Some  Resultes  in  Bivariate  Density  Estimation 

David  Scott,  Rice  University 

3:20  Coffee 

3:40  Consistency  and  Other  Larc,e  Sample  Properties  of  Maximum  Likelihood 
Estimates  of  Mixture  Parameters 
B.  Charles  Peters,  University  of  Houston 

4:30  Mixture  Densities,  Maximum  Likelihood,  and  the  Em  Algorithm 

Homer  Walker,  University  of  Houston 
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Luc  Devroye 
McGill  University 


SUMMARY. 

We  take  a critical  look  at  the  problem  of  estimating  a density  f on 
from  a sample  X^,...,X  of  independent  identically  distributed  random  vectors, 
and  review  some  recent  results  in  the  field.  Among  other  things,  we  will  qualify 
the  following  statements  : 

(i)  For  any  sequence  of  density  estimates  f . any  arbitrary  slow  rate  of 

n 

convergence  to  0 is  possible  for  E(/|f  -f|). 

(ii)  In  theoretical  comparisons  of  density  estimates,  one  should  use 
/If  -f|  and  not  /If  -f|P  , p > 1. 

(iii)  For  most  reasonable  nonparametric  density  estimates,  either  we 

have  convergence  of  / 1 f f | ( and  then  the  convergence  is  in  the 
strongest  possible  sense  for  all  f ),  or  we  have  no  convergence 

( and  then  we  don’t  even  have  convergence  in  the  weakest  possible 
sense  for  a single  f ).  There  is  no  intermediate  situation. 


* Research  of  the  author  was  supported  by  NSERC  Grant  A3455.  The  author  is  with  the 
School  of  Computer  Science,  McGill  University,  805  Sherbrooke  Street  West,  Montreal, 
Canada  H3A  2K6. 
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1.  INTRODUCTION. 

In  this  paper yVe  discuss  various  issues  related  to  the  problem  of 
estimating  a density  on  Rd  from  a sample  X^,...,Xn  of  independent  identically 
distributed  random  vectors  having  density  f,  such  as  : how  should  one  judge  the 
goodness  of  an  estimate;  is  there  an  optimal  estimate;  how  good  can  estimates  be 
for  small  n and  large  n;  and  does  it  pay  to  use  sophisticated  °stimates  ? The 
discussion  will  be  supplemented  with  a selected  survey  of  recent  results  in  the 
field. 


A density  estimate  is  a sequence  f. , f^> . • • (f  t • • • where  for  each  n, 
fn(x)  - fn(x,X1,...,Xn) 

is  a real-valued  Borel  measurable  function  of  x€Rd  and  the  data  X^,...,Xn. 

A density  estimate  can  be  parametric  or  nonparametric,  but  this  distinction 
is  not  important  in  what  follows.  The  prototype  parametric  estimate  is  defined 
as  follows  for  d-1  : 


* , - ( A *2.  * 1 r *2  1 r /v  4 

f is  normal  ( p,  o ) , p“  - \ X , o - - \ (X.-p) 
n n.j-,  i n.*-,  i 


The  most  frequently  used  nonparametric  estimate  is  the  kernel  estimate  (Rosenblatt 
(1956)  and  Parzen  (1962))  : 


f (x)  - £ \ h'dK((X  -x)/h)  , 
n " ^ i 

h>  0 is  a number  depending  upon  n, 

K is  a given  density  (kernel) . 

For  bibliographies  on  density  estimation,  see  Wegman  (1972),  Wertz  (1978),  Wertz 
and  Schneider  (1979)  and  Bean  and  Tsokos  (1980). 


(2)  - 


n , v 
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2.  MEASURES  OF  GOODNESS. 


0F  POOR  QUALITY 


We  would  like  to  obtain  a number  that  measures  how  close  f is  to  f 

n 

in  order  to  carry  out  theoretical  comparisons  between  estimates  later  on. 

For  a variety  of  reasons,  but  mostly  for  the  sake  of  convenience,  researchers 
have  proposed  the  criterion 


'<Vf)  * 


All  integrals  in  this  paper  ore  with  respect  to  Lebesgue  measure  (dx) . Note  that 
(3)  is  a random  variable,  and  that  it  is  necessary  to  take  its  expected  value. 

In  general,  we  can  consider  all  integral  measures  of  goodness  : 


E(/|fn-f|P)  , p>  1. 


We  will  now  argue  that  the  only  reasonable  integral  measure  is  the  measure 
with  p“l.  Our  argument  is  based  on  a couple  of  observations. 


1.  Let  g be  an  estimate  of  f.  If  X has  density  f n K , then  aX  has  density 

lx  lx 

■~f  (— ) , and  this  density  should  be  approximated  by  — g(— ).  But 


/lif(I)"I8(a)IP  " •1“P'U-8lP  • 


Thus,  the  only  L measure  that  is  independent  of  the  scale  is  the  L. 

P 1 

measure. 

2.  By  Minkowski's  inequality  we  have 

</|f-g|P)1/p  1 | </| f |P)1/P  - (/|g|P)1/p|  , 

where  the  lower  bound  is  infinite  if  one  of  the  terms  is  infinite  and  the 
other  one  is  finite.  Thus,  in  any  reasonable  theory  involving  the 
measure,  we  must  assume  first  that  f However,  the  only  space  to  which 

all  densities  belong  without  discrimination  is  L^. 


■«> 


3.  If  f and  g are  both  densities,  then  for  any  set  B£Rd,  the  probabilities 
of  B defined  by  f and  g respectively  differ  by  at  tnost 

A ■ aup  |/f  - /g|  . 

B B B 

-4 

For  example,  if  A is  known  to  be  less  than  10  , then  two  independent 

A 

samples  of  size  10  , one  from  f and  one  from  g,  are  all  but  statistically 
indistinguishable.  Thus,  keeping  A small  has  a true  practical  impact  in 
the  area  of  simulation.  But  clearly, 

a -|r|f-g|  . 

No  other  measure  has  any  connection  with  A in  the  sense  that  for  any 
p > 1 and  any  f,  there  exist  sequences  of  densities  f and  g^  such  that 

(i)  / | f n-f 1 ♦ 0,  /|fn-f |p  + - , 

(ii) /|f  -fj-  c > 0,  / |f  -f|P  * 0. 

n n 

2 2 

4.  If  f and  g are  normal  densities  with  zero  mean  and  variances  o and t , 
then  /|f-g|  depends  only  upon  o/r  , and  tends  to  0 if  and  only  if  o/t  -*•  1. 
However,  for  p > 1,  /|f-g|P  can  tend  to  • even  if  o/r  tends  to  1 (let 

o -*  0,  t -o+o'^  (2p+l)j , can  tend  t0  o even  if  o/t  tends  to  “ (let 


x •+  °°  and  o/r  -*■“). 
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3.  NEGATIVE  RESULTS. 

Many  a density  estimate  (such  as  the  kernel  estimate)  has  been  criticized 
for  not  performing  well  "for  small  sample  sizes".  Recent  work  in  the  area  of  1 

density  estimation  has  been  in  the  direction  of  improved  small  sample  performance 

and  automatization  of  the  estimate  ( automatization  of  the  kernel  estimate  means  , 

I 

that  the  parameter  h is  chosen  as  a function  of  the  data  ).  Fcr  research  in  this 

direction,  see  Deheuvels  (1977a, b) , Duin  (1976),  Scott  et.  al.  (1977),  Silverman 

(1978),  Davis  (1977),  Scott  et.  al.  (1981),  Wahba  (1977,  1978),  de  Montricher  et. 

al.  (1975),  Good  and  Gaskins  (1980),  Breiman  et.al.  (1977),  Nadaraya  (1974),  and  Devroye 

and  Wagner  (1980).  Most  automatization  schemes  are  so  sophisticated  that  it  is 

hard  to  prove  that  f converges  to  f in  any  sense  at  all.  In  fact,  many  schemes  * 

n i 

should  be  avoided  altogether.  For  example,  Schuster  and  Gregory  (1981)  have  shown  I 

1 

that  the  cross-validation  method  for  determining  "h"  in  the  kernel  estimate  will  i 

not  lead  to  a consistent  estimate  for  most  densities  f with  an  infinite  tail  (such 
as  the  exponential  density).  Consistent  cross-validated  density  estimation  is  also 
discussed  by  Chow,  Geman  and  Wu  (1981). 

Even  if  an  estimate  is  known  to  be  consistent  for  all  densities  f,  its  small 
sample  and  large  sample  properties  may  be  terrible.  The  search  for  always  better 
estimates  is  doomed  to  be  frustrating.  In  part,  this  frustration  is  captured  in  the 
following  result. 

Theorem  1. (Devroye,  1981a)  : 

For  every  density  estimate,  and  every  p 1,  and  every  sequence  of  positive 
numbers  tending  to  0 (a  ) , there  exists  a density  f on  R such  that  j 

i 

E(/|fn~f|p)  > a^  infinitely  often.  j 

We  can  always  find  such  an  f among  the  class  of  densities  bounded  by  2 and  vanishing 
outside  (0,1]^.  Moreover,  for  p“l,  the  density  f in  question  can  also  be  taken  from 
the  class  of  inf initely many  times  continuously  differentiable  functions. 


I 
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Thus,  any  kind  of  continuity  condition  alone,  however  strong,  is  not 
sufficient  for  the  study  of  the  rate  of  convergence  to  0 of  E ( / 1 f f | ), 
regardless  of  the  type  of  estimate  that  is  used  ! For  such  studies,  it 
seems  that  one  needs  combinations  of  continuity  and  tail  conditions. 

Theorem  1 is  in  the  spirit  of  a theorem  proved  by  Boyd  and  Steele  in  1979. 

Theorem  2. (Boyd  and  Steele,  1979/ 

For  every  density  estimate,  there  exists  a normal  density  f with  zero 
mean  such  that 

E(/|fn~f|^)  c(f)/n  infinitely  often  , 
where  c(f)  > 0 is  a constant  depending  upon  f only. 


In  a sense.  Theorem  2 gives  us  new  information.  Even  if  f Is  known 

to  be  norma]  with  zero  mean  and  unknown  variance,  it  is  impossible  to  find 

an  estimate  with  an  L ^ rate  of  convergence  that  is  better  than  1/n. 

The  theorem  cannot  be  improved  in  the  sense  that  the  parametric  estimate  (1) 

satisfies  E(/|f  -fl^)  < c(f)/n,  all  n (Maniya,  1969). 
n — 


Let  us  finally  point  out  that  several  results  that  have  received  widespread 
attention  to  date  are  practically  vacuous.  For  example,  Rosenblatt  (1971)  has 
shown  that  the  kernel  estimate  (2)  satisfies 


E(/|fn-f|2) 


h 


4 


2 2 2 2 

as  n -*•  ® , h -*■  0,  when  K is  bounded  and  symmetric,  dcl,  /x  K < »,  a“/K  , b“(/x  K)  If” 

and  f 6 •S'  “(all  densities  on  that  are  twice  continuously  differentiable  and  for 
o o 1/S 

which  If  < “ and  Si”  < » and  f is  bounded  ).  Thus,  if  we  take  h°(a/(bn))  , 

then 


E(/|f-£l2)^a4/V/5/n‘/5. 

1 n 4 


(4) 


Thus,  there  are  densities  f in  ^for  which  (4)  is  valid  and  for  which  at  the  sane 

time,  E(/|fn~f|)  >_  1/  log  log  log  n infinitely  often  (theorem  1).  But  without 

guarantees  for  the  performance  of  f in  L. , Rosenblatt's  result  looses  credibility. 

1/5  n 1 

Thus,  the  choice  h=(a/(bn))  for  the  kernel  estimate,  even  if  a and  b were  known, 
may  not  be  "optimal"  after  all  1 
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4.  POSITIVE  RESULTS. 

A thorough  study  of  global  rates  of  convergence  for  density  estimates 
in  general  and  the  kernel  estimate  in  particular  was  carried  out  by 
Bretagnolle  and  Huber  (1979).  We  cite  one  of  their  results  that  is  closest 
to  what  we  need  in  the  present  discussion. 

Theorem  3. (Bretagnolle  and  Huber,  1979) 

If  d“l  and  f {all  densities  with  compact  support,  that  are  s times 
differentiable  (s  > 1 is  an  integer)  such  that/|f^|  < •»},  and  if  the  kernel  K 
in  (2)  satisfies  : /K  «*  1,  /x^K»0  (0<j<s),  /|x|8|K|  < «■>,  K has  compact  support, 
then  a sequence  h=h(n)  can  be  found  such  that  for  the  kernel  estimate 

s 1 s 

lim  sup  n2s+^E(/| f^-f |)  £ c(s)  (/|f^s^ | Js+1(//f)2s+1  , some  c(s)>0. 

This  does  not  contradict  theorem  1 because  ^ combines  a continuity  condition 
and  a compactness  condition.  Unfortunately,  does  not  include  many  common 
densities  such  as  the  normal  and  exponential  densities. 


A second  positive  development  is  related  to  the  ob&ervation  that  for  most 

reasonable  nonparametric  density  estimates,  E(/|f  -f|)  -*•  0 for  all  densities  f on 

d ^ 

R . If  we  cannot  say  much  about  rates  of  convergence,  at  least  we  are  guaranteed 

that  the  estimates  are  consistent.  The  first  result  of  this  type  is  due  to 

Abou-Jaoude  (1976a,  1976b,  1976c),  who  studies  the  histogram  estimates.  Here 

, and  we 

The  histogram  estimate  is  defined 

I1A  t> 

by 


we  consider  a sequence  of  partitions  of  R , where  Pn“{A ■ ^,An2» • • • ^ 

denote  the  set  A . to  which  x belongs  by  A (x) 
nl  n 


fn(x) 


(nX (A  (x)) 
n 


-1 


n 

V4  I 
i*=l 


A (x)(V 

n 


(5) 


where  I is  the  indicator  function  and  X is  Lebesgue  measure.  Although  Abou-Jaoude 
treats  very  general  sorts  of  partitions,  we  will  only  state  his  results  for  the  most 


common  partitions 


: P consists  of  all  sets 
n 

i^L[aibn’(ai+1)bn) 


(6) 


where  a^,...,a^  can  take  all  the  integer  values,  and  b^  is  a sequence  of  positive 
numbers. 


if  .-n. fa.nl  ■ 


- .-v  f , * t 
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Theorem  4 . (Abou- Jaoude , 1976a, c) 

For  the  histogram  estimate  defined  by  (5-6),  the  following  conditions  are 
equivalent  : 


A.  /|f  -f  | -*•  0 in  probability  as  n for  all  f . 

B.  /(f^-f  I *•  0 almost  surely  as  n -*■  *,  for  all  f. 

C.  /jf^-f  | -*■  0 completely  as  n + *,  for  all  f. 

D.  lim  b "0,  lim  nb**  **  ®. 

n-*»  n-*00 


(A  sequence  of  random  variables  X^  converges  completely  to  0 if  for  all  e > 0, 
^P(|Xnj»e)  < Thus,  complete  convergence  implies  almost  sure  convergence.) 


For  histogram  estimates,  all  types  of  convergence  are  equivalent.  The 
convergence  of  the  kernel  estimate  for  all  densities  f was  first  observed  by 
Devroye  and  Wagner  (1979).  Devroye  (198b)  showed  a strong  equivalence  theorem  for 
the  kernel  estimate  : 


Theorem  5.  (Devroye,  19 8!b) 

For  the  kernel  estimate  (2)  with  a compact  support  kernel  K.  >_0  which 
integrates  to  1,  the  following  statements  are  equivalent  : 

A.  | -*■  0 in  probability  as  n + »,  for  some  f. 

B.  /|fn~f  I 0 almost  surely  as  n + ",  for  some  f. 

C.  /|f  -f  | -*•  0 completely  as  n + ",  for  some  f. 

D.  /If^-f  | -*■  0 completely  as  n -*■  °»,  for  all  f. 

E.  lim  h * 0,  lim  nh^  » 

n**“  n-"*® 

Furthermore,  E implies  D whenever  K is  absolutely  integrable  and  fKml. 

The  difference  with  Theorem  4 is  that  weak  convergence  (A)  for  one  f is  enough 
to  conclude  E in  Theorem  5,  while  weak  convergence  for  all  f is  needed  to  conclude  D 
in  Theorem  4.  Thus,  either  we  have ' convergence  in  for -kernel-estimates  (arid,  then  - 
the  convergence  is  in  the  strongest  possible -sense,  and  for  all  f),  or  we  have  no 
convergence  in  for  kernel  estimates  (and  then  the  estimate  does  not  even  converge 
in  the  weakest  sense  for  a single  f ) . 
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I . Inti oduct ion 

Virtually  all  nonoarametric  (infinite  dimensional)  problems 
require  the  choice  of  a "smoothing  parameter". 

Example:  x,,X2*.*.  i.i.d.  from  a distribution  with  unknown 

density  "f".  Consider  the  Parzen-Rosenblatt  kernel  estimator 
with  window  width  1/A: 

£n,X^>  ° ff  Jj  ’‘kCXCx-xp) 


where  k is  a probability  kernel,  or  the  histogram  with  bin 
width  1/A: 


x. 

1 


<!> 


X 


e [ 


k-l 

n 


In  each  case  A serves  as  a smoothing  parameter.  It  is  well- 

known  that  if  A t “•  sufficiently  slowly  then  f . f (e.g. 
n 7 7 n.A^ 

almost  surely  in  L^(R,B,dx)).  Depending  on  the  assumptions 
made,  optimal  rates  can  be  specified  for  An>  but  these  will 
always  depend  on  the  unknown  density  f.  How  should  A be  chosen 
for  a fixed,  finite,  sample?  For  moderate  sample  sizes,  both 
estimators  are  sensitive  to  the  choice  of  A.  This  is  the 
"smoothing  problem' . It  has  its  analogue  for  virtually  all 
(non-Bayesian)  nonparametric  density  estimators.  For  example, 
the  maximum  pena’ lzed  likelihood  estimator  requires  the  choice 
of  a weight  to  be  given  the  penalty  term.  Orthogonal  series 
estimators  (for  densities  or  regressions)  require  that  we 
specify  the  number  of  terms  to  be  used  m a truncated  series 
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expansion.  Splines  for  nonparametric  regression  typically 
arise  from  solving  a least  squares  problem  with  penalty,  which 
may  be,  for  example,  the  integral  of  the  squared  second  derivative 
of  the  estimator.  As  with  penalized  maximum  likelihood,  the 
smoothing  parameter  here  is  the  weight  given  the  penalty  term. 

Some  estimators  of  finite  dimensional  parameters  also 
contain  unspecified  smoothing  parameters.  In  fact,  it  is 
sometimes  useful  to  introduce  a smoothing  parameter  into  an 
estimator  that  is  otherwise  completely  specified.  Consider, 
for  example,  the  linear  regression  problem: 

yi  = xilBl"*  * •+Xip6p+Ei  * 1 i 1 i n ei  iid  N(°.cj2). 

Or,  in  vector-matrix  notation: 

Y « XR+e  e - N(0,o2I). 

The  least  squaies  (maximum  likelihood)  estimator  for  3 is 


3 


(XTX)-1XTY. 


The  ridge  estimator  for  3 is 

= (XTX+nXI) _1XTY  A > 0. 

A 

Observe  that  3q  is  the  least  squares  estimator.  The  introduction 
of  A into  the  least  squares  estimator  may  be  motivated  by  any 

A 

of  the  following  considerations.  (1)  3^  minimizes  an  equation 

of  the  form 

II  Y - x s | 2 + y||B||2. 


« • 
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A 

Hence  6^  may  be  viewed  as  a penalized  least  squares  estimator, 

A T 

with  penalty  for  large  values  of  6.  (2)  When  X X is  "nearly 

A 

singular"  (poorly  conditioned)  8q  has  la.rge  MSP  due  to  the  fact 

T 

that  the  inverse  of  X X is  involved  in  its  derivation.  Adding 
T 

nAI  to  X X improves  the  conditioning  and  may  be  expected  to 
reduce  MSE.  (3)  Perhaps  the  best  justification  for  ridge 
regression  is  the  following  easily  demonstrated  fact:  for 

every  n,  8,  and  o2  ^ 0,  there  exists  a X > 0 such  that 

H||  6-Bx|  2 < Ej]0-eol|2. 

Unfortunately,  the  optimal  \ (in  terms  of  MSE)  depends  on  0 
and  o^,  so  that  we  are  again  faced  with  a version  of  the 
"smoothing  problem". 

It  is  natural  to  attempt  to  use  the  data  to  guide  the  choice 
of  smoothing  parameter.  For  each  of  the  above  examples  many  such 
"data-driven"  estimators  have  been  proposed.  Perhaps  the  most  . 
widely  applicable  (certainly  the  most  widely  studied)  data- 
driven  technique  is  cross-validation.  Simulations  show  that 
cross-validation  can  bo  a very  effective  means  for  choosing 
smoothing  parameters.  However,  the  technique  can  badly  fail, 
and  the  conditions  for  success  are  not  wel 1 -understood . In 
fact,  almost  nothing  is  known  of  the  analytic  properties  of 
cross-val idated  estimators.  In  collaboration  with  Drs.  Y.S.  Chow 
and  L.-l).  Wu  (previously  visiting  Brown  University)  and  Aytul  l.rdul 
(currently  a graduate  student  at  Brown  University)  I have  been 
attempting  to  establish  some  of  the  analytic  properties  of 
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cross-validated  estimators.  In  the  remainder  of  this  talk  I 
will  introduce,  by  example,  the  method  of  cross-validation, 
and  announce  results  which  establish  consistency  for  certain 
cross-valided  density  estimators  and  consistency  as  well  as 
asymptotic  normality  for  ridge  regression. 


II.  Cross-validation  for  choosing  smoothing  parameters. 

This  method  is  best  introduced  by  example. 

A.  Kernel  and  histogram. 

Recall:  is  an  i.i.d.  sample  from  a 

distribution  with  unknown  density  "f".  fn  ^ is  either  the  kernel 
with  window  width  1/X,  or  the  histogram  with  bin  width  1 / A . 

The  problem  is  to  choose  X when  faced  with  a fixed  and  finite 
sample  Xjl,X2>  • • . »xn«  The  first  step  in  applying  cross-validation 
is  to  form  the  estimator  from  the  sample  after  first  deleting 
one  of  the  observations: 


CuW  *nH  xkC*<*-V)- 

f*  ^ ^(x^)  1S  a measure  the  appropriateness  of-X  for  smoothing 
the  estimator.  If  f*  ^(x.^  1S  *arRe»  we  cou^d  say,  loosely, 
that  f*  ^ ^(x)  "anticipated  the  observation  x^"  (for  fixed 
X,f*  i > (x)  is  formed  independent  of  x-).  If  f*  , .(x-)  is  small, 
then  x^  was  measured  as  "unlikely",  evidence  that  X does  not 
properly  smooth  the  estimator.  Through  this  procedure,  applied 
n times,  we  arrive  at  a likelihood-like  expression: 

Lx  ■ j,  Cl,  »('!>• 
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We  now  choose  X=X  to  maximize  L.  . The  cross-validated 

n X 

estimator  (due  to  Habbema  et  al.  (5)  and,  independently, 

Duin  (3))  is  £ . . Simulations  strongly  support  the  use  of  this 

n’An 

technique  for  certain  combinations  of  the  kernel  and  target 
density.  However,  the  method  can  fail,  and  in  surprisingly 
innocent  looking  situations.  For  example,  Schuster  and 
Gregory  (6)  have  shown  that  the  cross-validated  kernel,  using 
compact  kernel,  is  not  consistent  for  the  exponential  density. 

With  this,  as  with  all  cross-validated  estimators,  very  little 
is  known  analytically.  In  fact,  with  the  exception  of  the  results 
mentioned  below  for  kernels  and  histograms,  conditions  for  the 
consistency  of  cross-validated  density  estimators  are  unknown. 

B.  Ri-ige  regression 

Recall  that  the  ridge  estimator  for  8 in  the  model 
Y = X0  + e e - N(0,o2!) 


is 


0X  = (XTX+nXI) " 1XTY  X > 0. 


Define  8^  to  be  the  ridge  estimator  obtained  by  deleting 
(ignoring)  the  i'th  observation.  The  squared  error  in  predicting 

the  i'th  observation: 

n 7 

(y* ' Ji  xij'V 

measures  the  appropriateness  of  X as  a smoothing  parameter.  Define 

n n 


i = l 


j=l 


i 11  11  * ; ? 

■ n J,  O'!  - A, 


i! 


and  choose  1=1^  to  minimize  L^.  The  cross-validated  ridge 

A 

estimator  (due  to  Allen  (1))  is  B^  . Our  simulations,  and 

a n 

those  of  others,  indicate  that  B,  is  an  extremely  good  estimator 

T n 

for  B.  especially  when  X*X  is  nearly  singular  or  a is  large. 
Although  they  may  exist,  we  have  not  found  any  situations  in 
which  the  mean  squared  error  of  the  cross-validated  ridge 
regression  estimator  exceeds  that  of  the  ordinary  least- squares 
estimator.  Often,  the  ridge  estimator  reduces  the  MSE  of  least 
squares  by  SO  or  more  percent.' 

There  is  a closely  related  estimator,  due  to  Golub,  Heath, 
and  Wahba  (4),  called  the  "generalized  cross-validation"  (GCV) 
ridge  regressor.  The  GCV  ridge  regressor  is  computed  by  first 
rotating  the  coordinate  system  and  then  deriving  the  ordinary 
cross-validation  estimator.  Simulations  demonstrate  the  GCV 
generally  performs  somewhat  better  than  ordinary  cross-validation, 
and  GCV  proves  to  be  more  mathematically  tractable.  Although 
the  above-mentioned  analytic  results  are  for  GCV,  I will  not 
formally  define  the  GCV  estimator  since  this  would  require  that 
I introduce  somewhat  involved  notation.  Suffice  it  to  say  that 
GCV  is  ordinary  cross-validation  in  a rotated  coordinate  system. 

I should  emphasize  that  cross-validation  has  its  version  for 
all  of  the  estimators  mentioned  earlier,  each  of  which  requires 
the  choice  of  a smoothing  parameter  to  be  fully  defined.  Quite 
generally  simulations  support  its  good  potential,  and  quite 
generally  there  are  no  theoretical  results  available  about  the 
cross-validated  estimator.  Thus  questions  of  distribution, 
efficiency,  robustness,  and  even  consistency  are  almost  completely 
unanswered. 
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ITI.  Analytic  Results 

A.  Why  should  cross-validation  work? 

Before  stating  some  analytic  results  about  cross- 
validated  estimators,  let  me  outline  a heuristic  argument  in 
favor  of  the  technique-  in  the  ridge  regression  context.  This 
argument  has  its  analogue  for  most  cross-validated  estimators, 
whether  the  target  parameter  is  a density  or  a regression. 

In  some  cases  it  can  be  made  into  a proof  of  consistency  (as 
it  can  for  cross-validated  ridge  regression),  but  in  nonparametric 
problems  it  appears  that  one  must  take  a different  approach.  - 
Nevertheless,  the  motivation  is  similar  for  nonparametrric  as 
well  as  parametric  problems. 

The  cross-validated  ridge  regressor  is 

Bx  = (XTX+nXnI) -1XTY , 


where  X is  chosen  to  minimize 
n 


I n 11  /y.  « 

n XijeXj)  ' 


(*) 


n 


A j 

Although  B.-  depends  implicitly  on  Y,  it  is  reasonable  to  expect 
* J 

that  a version  of  the  law  of  large  numbers  will  be  in  force 

A i 

uniformly  in  B,.-.  This  leads  us  to  expect  that  for  large  n (*) 

A J 

is  close  to 


. n n 

EY  (yi  ' 


where  "Ey"  means  integration  with  respect  to  explicit  appearances 

of  the  components  of  Y,  treating  . as  constant.  It  is  also 

* 1 


Treasonable  to  expect  that  6..  will  differ  very  little  from 

especially  when  n is  large,  thus  we  choose  Xn  to  minimize 
an  expression  which  we  might  expect,  for  large  n,  to  be  close  to 


EV  k & - & HI  <v: 


- I Ey||Y-XBxi|2  = ij|XS-XBx||2  + o2. 

The  conclusion  is  that  the  cross-validated  estimator  attempts 
to  minimize  the  positive  definite  quadratic  form 

A fT.  yTy  A 

(**)  (B-Bx)  (B-8X). 

Since 

A IT*  y^Y  a 

(s-b0)t  V - 0 

A 

(recall  that  8Q  is  the  least  squares  estimator),  we  expect  that 
(**)  will  also  converge  to  0,  and  at  least  as  fast. 

B.  Ridge  regression 

Here,  loosely  stated,  is  what  we  know  about  the  analytic 
properties  of  the  cross -validated  ridge  estimator: 

A 

THEOREM  (with  Aytul  Erdal).  If  8^  is  the  GCV  ridge  regressor 

n 

then 

lj  6 a -B||  0 a . s . 


(XTX)1/2(Ba  -B)  ~ N(0,o2!). 


Observe  that  for  least  squares  the  distribution  of 
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1 . Summary 

This  report  summarizes  ongoing  work  concerned  with  the 
reconstruction  of  planar  sets  that  can  be  only  partially  observed. 
Details  of  the  problem  formulations  and  of  the  results  reported 
here  are  being  incorporated  m a report  describing  a broader 
class  of  problems,  specifically  the  estimation  of  an  intensity 
function  of  a planar  Poisson  process  based  on  observations  of 
stochastically  independent  fixed-angle  projections  of  the  process. 

First,  the  set-estimation  problem  is  formulated  and  connected 
to  reconstruction  methods  of  emission  computed  tomography.  Then 
the  inference  problem  per  se  will  be  isolated  and  approached  by 
traditional  estimation  methods. 

I shall  focus  on  the  special  case  of  estimating  an  unknown 
planar  convex  body  K2  that  is  a subset  of  a known  convex  body  . 
Poisson  events  occur  with  an  intensity  X(x,y)  that  is  spatially 
inhomogeneous  (and  temporally  homogeneous)  within  the  larger 
set  we  assume  for  our  prototypal  problem  that  X(x,y)  h X2 
within  K2  and  X(x,y)  = Xj  < X2  within  K^-K2.  The  Poisson  events 
are  projected  on  a line  with  fixed  arbitrary  orientation  0 
relative  to  the  horizontal  axis,  and  only  the  projected  points 
are  observable. 

The  underlying  model  for  generation  of  the  projected  point 
process  implies  that  its  univariate  intensity  function  uQ  is  a 
superposition  of  the  "shadows"  of  K-^  and  K2 . In  particular, 

u0(O  - XjWjto  + cx2-x1)w2U)  (1) 
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where  (i)  w^(£)  is  the  known  width  of  , in  direction  9+ir/2 
and  at  location  5 along  the  line  £Sq  , (ii)  w2(5)  *s  un^nown 
width  function  of  K2,  and  (iii)  X^  and  X2  are  the  unknown 
values  of  X(x,y)  within  K^-K2  and  K2 , respectively.  When 
and  K2  are  convex  then  w^  and  w2  are  unimodal  and  analogies  with 
familiar  nonparametric  inference  problems  can  be  drawn. 

The  problem  that  is  solved  in  this  report  is  the  character- 
ization of  the  maximum  likelihood  estimates  of  X^  and  of 
u = (X2-X1)w2,  under  the  constraints  on  the  structure  of  u that 
follow  from  convexity  of  K2-  The  characterization  is  patterned 
after  ones  that  are  familiar  in  the  context  cf  isotonic 
estimation  and  regression.  Specifically,  the  m.l.e.  u*  of  u 
attains  a maximum  value  on  a nondegenerate  interval 
To  the  left  of  (and  to  the  right  of  5-^),  u*  is  the  slope  of 
the  greatest  convex  minorant  of  a modified  counting  function 
for  the  univariate  point  process. 

The  characterization  of  u*  is  finite-dimensional  and  its 
computation  is  feasible.  The  intrinsic  complexity  of  the 
computation  of  u*  is  discussed  and  an  implemented  algorithm  is 
described.  Finally,  a simulation  example  illustrates  the 
performance  of  the  estimator  u*  and  the  use  of  u*  to  reconstruct 
the  boundary  of  K2> 
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1.  Introduction 
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As  an  instance  of  Grenander's  method  of  sieves  [2]  for 
adapting  the  maximum-likelihood  approach  to  settings  where  the 
target  parameter  is  infinite  dimensional,  we  have  considered 
density  functions  of  the  form 
00 

f(x)  = | i ♦((x-y)/c)G(dy)  - (*0*G) (x)\  (1) 

.00 

Here  G is  an  arbitrary  cdf  and  <}>  is  the  standard  normal  density 
function.  In  this  note,  we  shall  derive  a characterization  of 
the  cdf  G*  that  solve  the  maximum- likelihood  equation: 

^(G*)  = max  £f(G)  (2) 

G 

where  izf(G)  is  the  likelihood  function 

^(G)  = S f(x.)  (3) 

ial  1 

determined  by  a random  sample  x^  ,X2 , . . • ,:cn  from  an  unknown 
population  density  fg. 

Geman  and  Hwang  [1]  have  described  the  connection  between 
this  optimization  problem  and  nonparametric  maximum-likelihood 

. CO 

estimation.  In  brief,  if  we  specify  a sequence  of 

positive  values  with  a 4-  0 as  m °°  . then  the  sequence  of  sets 

m 

Sm  = {f  : f = <f>  *G,  G an  arbitrary  cdf} 

m 

defines  a sieve  of  subsets  of  L^,  the  so-called  convolution 
sieve . The  method-of-sieves  (i)  fixes  an  index  m,  depending 


on  sample  size  n and  on  the  sequence  (om) , (ii)  seeks  the 

A j«| 

solution  G of  (2)  determined  by  the  sample  {x.}.  , and  a , 
m r 1 1=1  m 

* * 

and  (iii)  forms  the  estimator  f = <6  *G  . 

m o_  m 
m 

The  familiar  Parzen-Rosenblatt  kernel  estimator  fits 
within  this  framework.  The  kernel  estimator  prescribes  G to 
be  the  empirical  cdf.  One  motivation  for  introducing  the 
convolution  sieve  is  to  study  the  relationship  between  the 
kernel  estimator  and  ones  derived  through  the  principle  of 
maximum  likelihood. 

Our  characterization  theorem  for  G*  exhibits  a rather 

* 

close  relationship  between  f and  the  kernel  estimator  based 

m 

on  the  Gaussian  kernel.  We  shall  show  that  the  solution  G* 
of  (2)  is  a discrete  cdf  and  that  it  contains  no  more  than  n 

ft 

points  in  its  support.  Thus,  the  estimator  f obtained  from 
the  method-of-sieves  admits  a representation  of  the  form 


f fx) 
m v 


(x-yj) 


9 


analogous  to  a familiar  form  of  the  kernel  estimator.  In 
contrast  to  the  kernel  estimator,  the  support 
{y^. } of  G*  does  not  coincide  with  the  sample  {x^}?^  and,  in 
general,  the  weights  { x j } will  not  be  identically  equal  to 
n *.  Computational  experiments  with  closely  related  sieves 
strongly  indicate  that  the  number  q of  points  in  the  support 
of  G*  will  typically  be  much  smaller  than  sample  size  n. 


■imfcii 
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2 . Characterization  Theorem 

Theorem.  Let  x1,x2,...,xn  be  a random  sample  from  a population 
with  density  fQ.  Let  a > 0 and  consider  estimators  f of  fQ 
defined  by  (1). 

(i)  There  exists  a solution  G*  of  the  maximum- likelihood 
problem  (2) -(3). 

(ii)  If  G*  satisfies  (2),  then  G*  is  a 'discrete  cdf  with 
finite  support.  Denote  supp(G)  = Then  q ^ n. 

(lii)X^  = min({xi>"=1)  < raax({xi)"=1)  - x^, 

then  < min ( { s . }?_^)  and  max({s ^ < x (ny 

Proof : We  may  assume,  for  convenience  and  without  loss  of 

generality,  that  o=l.  The  sample  values  can  be  rescaled, 
setting  = x^/o,  if  of  1. 

The  maximum  ofi^(G),  if  it  exists,  will  be  attained  by  3 
cdf  with  support  in  To  see  this,  consider  an 

arbitrary  right-continuous  cdf  G and  defined  Gq  in  terms  of  G by 


' o 

for 

x < x(1) 

G0((-»,x])  = , 

G((  — ,x]), 

for 

x(l)  i x < 

X(n) 

1 

for 

X(n)  - X* 

GQ  is  designed  so  that  Gq((x^^ 

})  ■ 

G((-,x(1)» 

and 

G0({X(.n^})  = G(  tx(n)  »°0 ) • Since  <J»  is  monotone  on  the  separate 
intervals  (-“,0]  and  [0,«0,  we  have 


r - 


■*  * * t 


42 


*(xrx(n)>V{xCn)})  1 \ 

X(n) "° 

jf(l)+0 

<*’^xi'x(l))G0^x(l)  - | 


<Kx-y)G(dy) 


<Kx-y)G(dy)  . 


and 


Consequently  Cd>*GQ)  (x)  > (4>*G)  (x)  for  all  x in  (x(i)*x(n)3  and 
hence  ^(G0)  > J^(G) . 

The  existence  of  a solution  G*  of  (2)  follows  from  (i) 
the  compactness  of  the  (tight)  family  of  cdfs  having  support 
in  a°d  (ii)  the  observation  that  G)  is  a 

bounded  and  continuous  functional  on  this  set  of  cdfs,  i.e. 
continuous  with  respect  to  the  topology  of  weak  convergence. 

Let  G*  be  a solution  of  (2)  and  set  f*  = (<J>*G*).  A 
variational  argument  characterizes  the  points  in  the  support  of 
G*  as  roots  of  a transcendental  equation.  Let  s be  an  arbitrary 
point  in  the  support  of  G*.  For  any  e > 0 and  for  any  z,  define 
a measure  H „ by 

Q r 7 • 


Hs,e,zCB)  = G*((s-e,s+e]  n (B-z)) 

Hc  _ is  a rigid  shift  through  distance  z of  G*  restricted  to 

ft  ft 

(s-e.s+e].  Define  G„  = G*-H  n.  Then  G„  + H_  , is  a 
v s,e  s ,e , 0 s,e  s,e,z 

cdf  for  any  z,  and  it  may  be  regarded  as  a local  perturbation 
near  s of  G*. 

Eet  {s,t,z  • ♦*[Gs,e',Hs,c,z1  and  observe  that  f*  - f s e 0 . 


Since  nf*(x^)  is  maximal,  we  have 


0 ' n .Ij  1o8 


PP,6^  IS 


Evaluation  of  the  derivative  gives 


n s+e 

= Ji  **  i 't>cxi-y-ziG*(d>rJ  jz_0 

s - e 

n S+G 

■ .1  rgfx.~r  { (*i-y)*Cxi-y)G*(dy). 

1 1 s - e 

Dividing  this  expression  by  G*(  (s-e , s + e] ) and  letting  e -+  0 
yields 

n (x.-s) 

Jx  T’TxTJ  *(xi's)  = °» 

for  any  s in  the  support  of  G*. 

Now  consider  the  function 
n (x.-y) 

TCy)  = Ji  T*T3TT  Mxi"y)* 

The  support  of  G*  is  a subset  of  the  set  of  roots  of  T. 
Properties  of  this  set  follow  from  the  connection  of  T with 
an  extended  Tchebycheff  system.  We  can  re-express  T as 

-y2/2  n -x?/2  x.y  -x?/2  x.y 

T (y ) - — — r — l [x.e  e 1 - e ye  1 ] 

✓Zir  i=l  1 


rv2n 


* 

n x.y  x.y 

l (a  e + b.  ye  1 ) . 

li=l 


The  expression  in  braces  is  a simple  linear  combination  of  the 
f xiy  xiy1n 

2n  functions  e , ye  When  the  x^'s  are  distinct,  this 


CqiC'NAL  PACE  fS 
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set  is  an  extended  Tchebycheff  system  of  order  2n.  (And  of 
course  if  {x^}?=1  is  a random  sample  from  population  density 
fg,  then  the  x^'s  are  distinct  w.p.l.  If  the  x^'s  were  not 
distinct,  we  could  reduce  the  order  of  the  system  accordingly 
to  express. T(y)  in  terms  of  an  extended  Tchebycheff  system  with 
fewer  than  2n  elements.)  The  Tchebycheff  property  implies: 

(i)  Z°  =>  {y  : T(y)*0}  has  at  most  2n-l  elements,  and 
(ii)  Z+"  * (y  : T(y)“0,  T*  (y)  < 0}  has  at  most  n elements 
(Karlin  and  Studden  [3]). 

Since  the  support  of  G*  is  contained  in  Z®,  G*  is  discrete  with 
at  most  2n-l  jumps. 

In  order  to  show  that  G*  has  at  most  n jumps,  it  suffices 
to  show  that  the  support  of  G*  is  actually  contained  in  Z+*, 
i.e.  that  T' (s)  < 0 for  any  s in  the  support  of  G*.  For  fA, 
we  can  now  write 

f*(x)  * ? p,  4>(x-s.) 

j«l  J J 

where  {s^}?.^  is  the  support  of  G*,  q < 2n-l,  p^  > 0,  and 

\ p.  = 1.  Set  s=s5,  for  fixed  l between  1 and  q.  Let  e > 0 
1 J * 

and  define  a perturbation  f£  of  f*  by 

„ Pp  P £ 

fj-U)  “ Pj^(x-Sj)  + 4> (x-s  + c)  + J-  <J>(x-s-e) . 

The  density  f admits  a representation  of  the  form  (1)  and 
f*  * f q . Since  Tl£* (xx ) *s  max^ma^» 


- 


I loe  fe(xi) 1 £ °- 

de  i=l  e 1 ' e=0 


Straightforward  calculation  yields 
,2  n 


l log  f (x.)  - p,  T' (s) 

i=l  E 1 e=0  * 


and  hence,  as  claimed,  T'(s)  £ 0. 

Finally,  to  confirm  the  last  statement  in  the  theorem, 
observe  that  if  s £ x(i)  f°r  some  s i-n  t^ie  support  of  G*,  then 
(|)(x^-s)  is  strictly  increasing  for  sufficiently  small  increases 
in  s and  for  all  x^,,  except  perhaps  x^j.  Further, 

dlf  (Kx^-j-s)  > 0 as  long  as  s £ x(i)»  ^ence  nf*(x^)  is  a 
strictly  increasing  function  of  s,  contradicting  the  maximum- 
likelihood  property  of  G*  and  f*.  The  same  reasoning  precludes 
s > x . □ 


3 . Concluding  Remarks 


The  characterization  theorem  was  announced  in  the  paper 
by  Geman  and  Hwang  [1] , where  consistency  questions  for  f*  are 
analyzed.  The  consistency  results  guarantee  that  f*  fQ 
in  L^-norm,  with  probability  one,  provided  that  o 0 
sufficiently  slowly  as  sample  size  n -»■  °°. 

j 

'•  H.  Robbins  recently  restimulated  interest  in  the  maximum- 

|*  likelihood  problem  per  se  during  his  lecture  at  the  NASA 

i 

Workshop  on  Density  Estimation  and  Function  Smoothing  at 
j Texas  A5M  University,  March  11-13,  1982.  Professor  Robbins 

recalled  his  1950  formulation  of  the  maximum-likelihood 
problem  (l)-(3)  in  [4]  wherein  connections  are  made  with 
[*  statistical  decision  problems. 
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K Introduction 

Remote  sensing  of  the  atmosphere  Is  a rapidly  developing  science. 
Today's  meteorological  satellites  such  as  those  In  the  TIROS-N  series  have 
high  resolution  Instruments  on  board  which  measure  the  Intensity  of 
upwelling  radiation  In  selected  channel  frequencies.  A description  of  the 
data  retrieved  by  the  radiometers  on  the  TIROS-N  type  satellites  can  be 
found  in  [7].  From  these  data  It  Is  possible  to  obtain  Information  on  the 
atmosphere's  temperature,  moisture  and  wind  structure.  One  of  the  goals  of 
the  current  Satellite  Meteorology  program  Is  to  Improve  the  quality  of 
atmospheric  information  obtained  from  satellite  soundings  to  a point  where 
it  can  be  used  for  weather  forecasting  purposes.  A major  challenge  in  this 
direction  Is  to  develop  refined  numerical  and  statistical  methods  for 
Inverting  the  equations  of  radiative  transfer  given  a finite  number  of 
noisy  measurements. 

For  a non-scattering  atmosphere  In  local  thermodynamic  equilibrium  the 
radiative  transfer  equations  (RTE's)  describe  how  the  satellite  upwelling 
radiance  measurements  relate  to  the  underlying  temperature  distribution  T:- 

A PO  j 

3v(T)  * Bv[T(p0)]tv(P0)  " l BvCT(P  Vv(P)dp  {1A) 

where  pQ  Is  the  surface  pressure,  t^{ p)  Is  the  transmittance  of  the 
atmosphere  above  pressure  p at  frequency  v,  and  By  Is  Plank's  function 
given  by:- 


i 


J 


^ , * A 


■ *.  M — ■ 
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BvtT(p)]  = CjvVfexptCgv/Kp))  - 1} 

cx  = 1.19061X10“5erg-cm2-sec-1  (1.2) 

c2  s 1.43868  cm-deg(K) 

The  R.T.E's  are  of  course  an  Idealization.  They  describe  the  Inten- 
sities the  satellite  radiometer  would  record  in  the  absence  of  such  things 
as  atmospheric  attenuation  due  to  clouds  or  instrument  noise.  However, 
by  using  high  resolution  radiometers  like  the  HIRS  or  AVHRR,  sets  of 
intensity  measurements  from  many  FOV's  (fields  of  vision)  can  be  combined 
to  obtain  data  of  the  form 

Zj  = ^ (T)  + e^  i = l,...,n  (1.3) 

where  e^'s  are  errors.  These  data  relate  to  an  area  of  about  119  by  140  km 
on  the  earth's  surface.  See  [6]  for  more  details. 

We  are  interested  in  refining  the  methods  used  to  obtain  temperature 
distribution  estimates  from  the  above  data.  The  procedure  currently  used 
to  process  TIROS-N  temperature  sounding  data  is  a linear  regression 
technique  see  [6].  Here  we  begin  to  discuss  how  the  method  of  regulariza- 
tion (M.O.R.)  might  be  used  to  Improve  the  quality  of  temperature  profiles 
obtainable  by  this  procedure. 
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Let  T be  the  true  temperature  profile  It  the  atmosphere.  Then  T can  be 
written  as 

T - T0  + 5 (1.4) 

where  Tq  Is  the  current  best  guess  of  T and  6 Is  the  update  or  correction 
to  Tq  to  be  estimated  from  the  data  {z^}  in  hand.  Using  M.O.R.  to  estimate 
6 involves  consideration  of  a functional  1^  given  by 


and  picking  the  estimated  update  6^  to  minimize  this  functional^  over  some 
class  of  physically  plausible  candidates,  for  instance  the  set  of  functions 
6 in  V^CO.Pq]  for  which  T^+6  is  positive  or  perhaps,  if  the  location 
of  the  temperature  inversion  were  reliably  known,  one  would  look  for  mlni- 
mizers  of  1^  subject  to  an  additional  constraint  involving  temperature 
inversion. 

The  statistical  reasoning  for  considering  regularized  estimates  of  this 
type  is  well  documented  in  the  literature,  see  for  example  [3]  and  [1], 
Intuitively  6 has  been  designed  to  match  the  observed  data  and  possess 
certain  smoothness  qualities.  The  parameter  x controls  a tradeoff  between 


the  smoothness  of  a solution  (measured  by 


Po 

0 J 


(m), 


p)3  dp)  and  how  well  it 


2 

[1]  This  corresponds  to  the  case  when  the  measurement  errors  are  iid  N(0,o  ). 
A more  "robust"  method  would  be  to  consider  functionals  of  the  form 

M«)  = l pCz.-1  (Tn+6)]  + * /°[5(m)(p)]2dp 
A i=l  1 M u 0 

where  p reflected  the  possible  non-Gaussian  nature  of  the  noise. 


i 


matches  the  data  (the  l tz.- $ {Trt+6  )]2  term). 

1-1  1 M 0 * 

Inverting  the  R.T.E.'s  with  noisy  data  can  be  viewed  as  a special  case 
of  a more  general  situation  in  which  the  scientist  wishes  to  estimate  a 
function  x given  data 

Zj  ■ Nj(x)  + 1 « l,...,n  (1.6) 

where  x Is  In  some  Hilbert  space  H,  the  N> ' s are  non-linear  functionals 
and  e^'s  are  noise.  Here,  assuming  the  e^'s  are  lid  N(0,o2),  an 
appropriate  regularization  function  1^  Is 

I.(x)  - l [z.-N.(x)]2  + XJ{x)  (1.7) 

* 1-1  1 1 

where  J Is  a roughness  penalty  functional  on  H.  To  estimate  x one  proceeds 
to  minimize  1^  over  some  subset  of  physical  Interest  In  H.  This  report 
summarizes  recent  results  we  have  obtained  on  the  existence  and  numerical 
approxlmabll Ity  ot  mlnimlzers  of  such  I^'s  In  certain  subsets  of  H.  We 
indicate  how  these  results  apply  to  the  radiative  transfer  equations  case. 

There  are  three  sections:  section  2 talks  about  the  existence  theory; 

a Gauss-Newton  algorithm  for  minimizing  the  regularization  functionals  Is 
outlined  In  section  3,  while  the  final  section  briefly  describes  how  to 
estimate  the  smoothing  parameter  using  a first  order  approximation  to  the 
generalized  cross  validation  function  given  In  [8].  We  assume  the  reader 
Is  familiar  with  the  basic  mathematical  tools  for  discussing  minimization 
problems  In  Hilbert  spaces.  Part  1 of  Ekeland  and  Temam's  book  [2]  Is  an 
Inspiring  Introduction  to  this  subject. 


2.  Existence  Theory 
Prel iminaries 
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Before  describing  our  main  results,  let's  pause  a moment  to  get  our 

notation  straight.  H is  a real  Hilbert  space  with  inner  product  <♦,*>  and 

2 

norm  ||*|l  (so  <x,x>“||x||  ).  P is  a projection  operator  in  H with  finite 
dimentional  null  space;  the  complementary  projection  I-P  is  denoted  by  PQ* 
H*  is  the  dual  space  of  H,  i.e.  the  space  of  all  continuous  linear  maps 
from  H Into  R.  L(H,H*)  is  the  space  of  linear  operators  from  H into  H*. 

We  will  discuss  functionals,  I say,  acting  on  H (so  I:  H+R).  The  first  and 
second  Frechet  derivatives  of  I at  some  point  xcli  will  be  denoted  by  I'(x) 
and  I" ( x)  respectively.  Think  of  !'{x)  as  an  element  of  H*  and  l"(x)  as 
an  element  of  L(H,H*).  Our  concern  here  is  with  regularization  functionals 
I on  H given  by 


I.  (x)  = l tz.-N.(x)]2  + xllPxll2  (2.1) 

* 1=1  1 1 

where  N^'s  are  functionals  on  H,  z^'s  are  inU,  xcH  and  x>0.  Whenever  we 
write  1^  the  form  (2.1)  will  be  what  is  meant.  So  we  are  considering  regu- 
larization procedures  in  which  the  roughness  penalty  J(x)  is  a semi-norm  on 
H given  by  J ( x ) = | |Px|  |2  . 

Main  Results 

We  now  specify  conditions  on  the  non-linear  functionals  which 
guarantee  the  existence  of  minimizers  of  1^  in  closed  convex  subsets  K of 
H.  In  the  R.T.E.  case  a reasonable  choice  for  K is  the  set  of  all  func- 
tions in  W2m[0.pQ]  for  which  T +6  is  positive.  It  is  very  easy  to  check 
that  this  K is  a closed  convex  subset  of  W2m[0,p0]  for  any  m.  Our 
existence  results  are  summarized  in  the  following  three  theorems. 
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Theorem  1 {proof  in  [2]  pp.  34-35). 

Let  K be  a closed  convex  subset  of  a Hilbert  space  H.  Suppose  1^:  K+R  is 
coercive  on  K (i.e.  I (x)t~  as  ||x||+<»  in  K)  and  moreover  that  I is  weakly 
lower  semi-continuous  (w.l.s.c.)  on  K then  1^  attains  its  infimum  on  K. 

Theorem  2 (proof  in  [4]) 

Let  $:  R+R  be  a monotonic  Increasing  function  in  the  modulus  of  its  argu- 
ment. Suppose 
n 

(i)  I <f>(N . (x) ) is  convex  on  K 
1=1  1 

n 

(ii)  l $[N.(x)]  = $<=>  P x = P e for  some  e In  K 
i=l  1 00 

then  1^  is  coercive  on  K. 

Remark:  The  above  theorem  can  be  generalized  somewhat  but  we  refrain  from 
doing  so  because  the  form  given  has  more  intuitive  appeal. 

Theorem  3 

If  n1  is  weakly  continuous  (w.c.)  on  K for  each  i then  I Is  w.l.s.c.  on  K. 

n 

Proof:  If  the  N.  are  w.c.,  then  it  surely  follows  that  £ [z  -N  (x)j  is 

1=1  i 1 

w.c.  But  ||Px|  |2  is  well  known  to  be  w.l.s.c.  Therefore  1^  is 
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AppUcatlon  to  the  R.T.E.'s  {see  [4]  for  details) 

The  arising  here  can  be  shown  to  satisfy  the  hypotheses  of  Theorem  2 
with  <j>  taken  to  be 

4>(x)  = |x|.  xcR 

Also,  each  4 Is  w.c.  We  therefore  have  that  for  each 
U1 

x>0,3  = {fieW^  [0,Pq3  Tq+5>0} , s.t. 


1.(6.)  - min  { l [z.-$  (Tn+6)]Z  + x/'[6(m){p)] 
X X 6gK  1=1  1 M U 0 


dp} 


There  exist  regularized  solutions  to  the  R.T.E.'s. 


3.  A numerical  procedure  for  minimizing  I.  in  K 


‘ * I 


Let  xk  be  the  kth  approximation  to  the  mlnimlzer  In  K of  Ix*  Define 

k 

the  functional  Ix  on  K as  follows 


IxK(x)  = ][  [z^fxM-NjtxNCx-x*]]2  + XllPxir  (3.1) 
each  is  simply  linearized  about  x . Define  x to  be  the  mlnimlzer  in 
K of  Ixk. 

k 

Under  suitable  regularity  conditions  the  iterates  x are  well  defined 
and  can  be  shown  to  satisfy 


’1 


k+1  _ „k 


x = xK  - { J N’ (xk)N* (xk)  + X<P.,.>}"1!'  (xK) 
1=1  1 1 x 

= xk  - A-1  ( xk ) I ' ( xk ) 


i *1 r ■ f„k« 


(3.2) 
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That  this  equation  makes  good  senje  <s  evident  once  one  realizes  that  A(x  ) 
belongs  to  L(H,H*)  and  is  In  H*. 

Those  in  the  know  will  have  recognized  that  the  above  procedure  is 
nothing  more  than  an  infinite  dimensional  version  of  the  Gauss-Newton 
algorithm.  The  finite  dimentional  case  is  discussed  in  [5].  The  major 
advantage  of  using  a Gauss-Newton  procedure  to  minimize  our  regularization 
functionals  is  the  ease  with  which  successive  interates  can  be  obtained. 

At  each  stage  we  have  a regularization  problem  involving  linear  func- 

b 

tionals,  the  N.'(x  }'s,  consequently  we  can  take  advantage  of  available 
software  tools. 

With  the  appropriate  assumptions  it  is  possible  to  show  that  the  proce- 

k 

dure  is  a decent  method  and  the  sequence  x converges  at  least  R-linearly 
to  a critical  point  of  Ix  in  K. 

Theorem  4 (proof  in  [4]}. 

Suppose  that  the  N.M's  are  twice  continuously  differentiable  and 

I . 

Nj  ( . ) 1 s are  w.c.  on  int  K.  Let  x e int  K be  such  that 

1°  = jxll^lx)  < Ix(x°)} 

is  weakly  compact  and  Ix  has  only  finitely  many  critical  points  in  L°  . 
Moreover,  suppose  that  all  positive  with  satisfying 

u0llh||2  < <h,A(x)h>  < u-j  1 1 h 1 12  , Ix"(x)hh  < y-j  I ! h 1 12  Yxel°,  heH 

then  the  sequence  of  iterates  {xNc  L°,  11m  x^  = x*  where  I.  '(x*)  = 0 and 
* k 

if  Ix"(x  ) is  non-singular,  then  the  convergence  is  at  least  R-linear. 

The  proof  follows  an  argument  similar  to  that  used  in  14.4.6  of  [5]. 
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4.  The  choice  of  X 


The  generalized  cross  validation  method  for  choosing  X works  as  follows. 

[k]  2 

Let  be  the  minimizer  in  K of 


I [z.-N.{x)]2  + X | [Px  | r 
1-1  1 1 


Then  X is  chosen  to  minimize 


where  Mx  k ) is  the  prediction  of  z.  given  the  data  z,  , 

N A x 12  k-1 

zk+l'"zn  and  akk*^X'  1s  the  ”d1fferent*a1  influence"  of  the  zfc'th  data 
point  on  the  estimate  x.  (x  is  the  minimizer  in  K of  1 ). 

A A \ 


*kk*u|  ■ 


"k'»x[k]>-W 
"kfx*tk3»— k 


From  a computational  viewpoint  V(X)  Is  prohibitively  expensive  so  one  needs 
to  find  some  convenient  approximation.  Following  Wahba  [8],  V(x)  can  be 
approximated  by 


’appro*'11 


Cl-i^U)]2 


[2]  Assumed  to  be  uniquely  defined. 


1 


where  v\  9*ven  by 
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1*1 


(X) 


ly 

v>i 


n ) 


3z. 


„ an  easily  computed  functional  of  V He  hope  to  study  this  procedure 
more  closely  In  the  near  future. 
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QUANTILES,  PARAMETRIC- SELECT  DENSITY  ESTIMATION, 

AND  BI- INFORMATION  PARAMETER  ESTIMATORS 


by 

EMANUEL  PAR2EN 
Institute  of  Statistics 
Texas  A&M  University 


Abstract 

This  paper  outlines  a quantile-based  approach  to  trratistical 
analysis  and  probability  modeling  of  data  which  forrrulares 
statistical  inference  problems  as  functional  inference  problems 
in  which  tne  parameters  to  be  estimated, are  density  futction3. 
Density  estimators  can  be  non-parametric  (computed  independently 
of  model  identified)  or  parametric -select  (approximate!  by  finite 
parametric  models  that  can  provide  standard  models  vhete  fit 
can  be  tested).  Exponential  models  and  autoregressive  models 
are  approximating  densities  which  can  be  justified  as  maximum 
entropy  for  respectively  the  entropy  of  a probability  density 
and  the  entropy  of  a quantile  density.  Applications  cf  these 
ideas  are  outlined  to  the  problems  of  modeling:  (1)  unvariate 
data;  (2)  bivariate  data  and  tests  for  independence;  «nd  (.3) 
two  samples  and  likelihood  ratios.  It  is  proposed  the- 
bi-informaticn  estimation  of  a density  function  can  developed 
by  analogy  to  the  problem  of  identification  of  ’egression  models. 
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1 • Statistical  Science,  data  analysis,  and  Buffalo  snowfall 

Statisticians  complain  about  the  failure  of  universities 

to  adequately  educate  students  on  how  to  analyze  statistical 

data.  At  the  same  time  some  statisticians  state  that  data 

analysis  is  an  art,  and  thus  cannot  be  taught.  When  these 

statisticians  speak  of  statistical  science  it  is  difficult  to 

imagine  to  what  they  are  alluding  since  they  seem  to 

sneeringly  reject  all  attempts  to  reason,  and  reach  consensus, 

% 

about  the  evaluation  of  methods  to  be  used  as  part  of  the  process 
of  statistical  data  analysis. 

I would  like  to  propose  a data  set  which  I believe  provides 
a useful  test  case  for  various  approaches  to  data  analysis, 
namely  the  annual  time  series  of  snowfall  in  Buffalo,  N.Y.  The 
segment  of  that  series  which  I will  discuss  is  1910-1972, 
although  it  has  many  interesting  features  when  extended  to  1981. 
The  data  analysis  question  to  be  considered  is-  What  probability 
distributions  can  be  used  to  describe  Buffalo  snowfall . A n 
ever-present  hypothesis  to  be  considered  is  whether  Buffalo 
snowfall  is  normal . 

2 . Functions  that  describe  probability  distributions 

The  probability  law  of  a continuous  random  variable  X can 
be  described  by  one  or  more  of  the  following  functions: 

(1)  Distribution  Function  F(x)  “ Pr  [X<^c] 

(2)  Probability  Density  Function  f(x)  » F' (x) 


c.r'1'-'’' "f’l.  FA(V.i  ti» 

(3)  Quantile  Function  Q(u)  = F ^ (u) 

“ inf  {x:  F(x)  _>  u} 

» inf  {x:  F(x)  **  u}  if  F is  continuous 
" x such  that  F(x)  «*  u if  F increasing  at  x 

(4)  Quantile-Density  Function  q(u)  = Q’ (u) 

(5)  Density-Quantile  Function  fQ(u)  = f(Q(u)) 

Theorem:  For  F continuous 

FQ(u)  « u , fQ(u)  q(u)  = 1 


3.  Raw  functions  that  describe  samples 

Data  X^ Xn  is  called  a random  sample  of  X when 

\ X^ Xn  are  independent  random  variables  identically 

— distributed  as  X.  An  important  role  in  the  analysis  of  a sample 

is  played  by  the  order  statistics  < X^)*****  x(n) 

(1)  Sample  Distribution  F(x)  = fraction  X^ Xn  <_  x 

" n * X(j)-  x<X(j+l) 

(2)  Sample  Probability  Density,  or  Histogram,  estimates 
f (x)  by  a numerical  derivative 

f(X)  - 

(3)  Sample  Quantile  Q(u)  = F~^(u) 

- x«>-  ^ <u 

A universal  display  of  any  data  set  is  provided  by  the  quantile 
box  plot  introduced  in  Parzen  (1979). 


(A) 


Sample  Quantile-Density  is  a numerical  derivative 
q(u)  - <j(u+h>2- 

(5)  Sample  Density-Quantile  = fQ(u)  = l/q(u). 

An  important  formula  is 

* ^4rr>  ‘ 2 Un+ixxy^-xy.!,));1 

4.  Smooth  functions  that  describe  samples  and  estimate 
probability  distributions 

The  functions  F,  f,  Q,  q,  fQ  that  represent  the  true 
probability  distribution  of  a random  variable  X are  estimated  by 

A A A A A A 

smooth  functions  F,  f,  Q,  q,  fQ  which  are  derived  from  the  raw 
descriptive  functions  F,  f,  Q,  q,  fQ.  One  distinguishes  between 
parametric  and  non-parametric  methods  of  estimating  smooth 
functions . 

A parametric  estimation  method  : (1)  assumes  a family 
Fq,  fQ,  Qg,  q01  f g Q q of  functions,  called  parametric  models, 
which  are  indexed  by  a parameter  0 = (8^ 0^) ; (2)  forms 

A A A 

estimators  0 = of  0;  (3)  forms  smooth  functions  by 

F(x)  = Fg(x),  f(x)  =*  fg(x), 

Q(u)  = Qg(u),  q(u)  = qg(u), 
fQ(u)  = fgQg(u). 

A non-parametric  estimation  method  forms  estimators  which 
are  not  based  on  parametric  models.  Important  examples  of 
non-parametric  estimators  of  a probability  density  f(x)  and  a 


i . 
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quantile-density  q(u)  are  respectively 
f(x)  - | /“  K(^)  dF  (x) 

— CO 

q(u)  - | Z1  K(H^)  dQ(u) 
o 

for  suitable  kernels  K(*)  and  bandwidth  6. 

5 . Parameter  estimation  and  information  divergence 

When  a parametric  model  fg  is  assumed,  parameter  estimators 

A **» 

0 are  often  determined  by  minimizing  a "distance"  between  f(x) 
and  f0(x).  A "distance"  between  two  probability  densities  f(x) 
and  g(x)  is  denoted  I(f;g)  and  is  called  an  information  divergence 
between  f(x)  and  g(x).  It  is  usually  not  symmetric  in  f and  g. 

It  does  not  satisfy  the  triangle  inequality  for  a metric.  But 
it  does  satisfy  I(f;g)  > 0 and  I (f;g)  = 0 if  and  only  if  f = g. 

The  most  famous,  and  most  important,  definition  of 
information  divergence  is 

VfiS)  - r - loe(f(0>  f(x)  dx 

called  the  information  divergence  of  order  I,  or  Kullback- 
Liebler  information  divergence.  Information  divergence  of 
order  a is  defined  for  a>0  (but  a f 1)  by 

Ia(f;S)  = loS  T dx. 

••  oo  ' * 

The  most  important  values  of  a are  0.5<a£2. 

Bi- information  divergence  is  defined  by 

II(f:g>  = rilog  f(x)  dx; 

it  may  be  regarded  as  related  to  I2(g;f). 
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into  a.  *i*ct  L 1 0*4  oi  Vc*^L!iUC  O OX  , _ - 4 ^ tw(  _ ' ~4  — - 

Ij^fjg)  “ H(f;g)  - H(f ) 

defining 

H(f ; g)  - r {-log  g(x) } f(x)  dx, 

— OO 

H(f ) - H (f ; f ) =*  /”  {-log  f (x)}  f (x)  dx. 

— 00 

We  call  H(f;g)  the  cioss-entropy  of  f and  g,  and  call  H(f)  the 
entropy  of  f. 

Maximum  likelihood  parameter  estimation  can  be  shown  to 
be  equivalent  to  minimum  cross-entropy  estimation.  The 
likelihood  function  of  a parametric  model  fQ  is  defined  by 

L(f0)  = log  f g (Xx Xn) 

- l i°s  f0(xt) 

One  may  verify  that 

L(f0)  = n J”  log  f g ( x)  dF (x) 

— CO 

= -n  H (f ; f0) . 

A 

The  maximum  likelihood  parameter  estimator  6,  defined  by 


L(fg>  * o L(fe>  • 

clearly  satisfies 

H(f;fg)  - mjn  H(f;f0). 

It  also  satisfies 

Il(i;fp)  - mln  Ij.Cf.fg)  • 

In  general  parameter  estimators  0 are  found  by  minimizing 
I (f;fQ)  or  I (fQ,f).  Chi-squared  estimators  minimize  I~(ffi;f) 
while  modified  chi-squared  estimators  minimize  ^(fjf 0). 


OHIUilMKl.  rj-vj— 
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To  compute  I^(f;fg)  one  needs  to  compute  H(f) 
formula  for  accomplishing  this  is 

H(f)  ~ /"{-log  f(x)}  dF(x) 


A useful 


/*  {-log  fQ(u) } du 
o 

/*  log  q (u)  du. 


The  value  of  I^(f;fg)  can  be  used  to  test  the  goodness  of  fit 


of  the  parametric  model  f^. 


6 . Information  and  bi-information  parameter  estimation,  and 
comparison  distribution  functions 

•• 

Given  a sample  with  sample  probability  density  function  f 
and  parametric  model  f^,  one  can  form  diverse  parameter 

A V 

estimators,  denoted  0 and  0,  corresponding  to  two  choices  of 

information  divergence  which  we  take  to  be:  (1)  i^(f;fg),  and 

*•  ^ v 

(2)  12^6*^  or  We  call  9 and  8 diverse  parameter 

estimators.  For  greater  precision  we  call  0 the  (order  1) 
information  estimator,  and  $ the  bi-information  estimator. 

When  the  parametric  model  f0  is  exact,  the  diverse 
parameter  estimators  have  equivalent  statistical  properties; 
they  are  both  asymptotically  efficient  estimators,  and  are  not 
significantly  different  from  each  other. 

A ^ 

When  the  values  of  9 and  9 computed  from  a sample  are 
significantly  different  one  should  suspect  that  the  parametric 
model  fQ  does  not  fit  the  data.  The  Shapiro-Wilk  statistics 


*-  •*  r>r* 


for  testing  normality  and  exponentiality  can  be  regarded  as 
comparing  diverse  estimators  which  minimize  information  of 
order  1 and  2 respectively. 

A \y 

One  can  interpret  0 and  0 as  parameter  values  of  "best 
approximating"  models. 

One  wishes  to  evaluate  F0(x)  and  Fg(x)  as  smooth  estimators 
of  F(x).  For  any  parameter  value  0,  define 

D0(u)  = F0(Q(u)) 

which  is  the  sample  quantile  function  of  the  transformed 
random  variables 

U1  " VX1> Un  - W- 

The  true  parameter  value  0 has  the  property  that 
are  distributed  with  a uniform  [0,1]  distribution.  Then 
parameter  estimators  0 and  0 are  compared  by  the  character  of 
the  closeness  to  the  identity  function  D(u)  * u of  D^Cu)  and 
D£(u)  . 

We  call  D (u)  a comparison  distribution  function.  Its 

U 

derivative 

dQ(u)  « (D0(u)}' 

plays  a basic  role  and  is  called  a comparison  density;  formulas 
for  the  comparison  density  are 

dQ(u)  = f q (Q(u)  q (u) 

_ f0(Q(u)) 
f Q (u) 


> 
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An  alternative  comparison  density  introduced  in  Parzen 
(1979),  is 

d(u)  = f0Q0Cu)  q(«)  * o0. 

°D  * Z1  f0Q0<u)  9(u)  du* 

o 

D(u)  - /u  d(t)  dt 
o 

where  f0Q0(u)  is  a specified  density-quantile  function. 

Parameter  estimators  can  be  justified  as  minimizing 
information  divergence 

VV  = Z1  "loS  Vu>  du  = Il(^f0) 

II(de)  = J1  |log  dQ (u) | 2 du  = II(f;fe) 

VV  = log  J1  {d0(u>  }1_adu 

/1|dQ(u)  - 1|2  du  = J1  | d0 (u) | 2 du-1 

These  measure  the  closeness  to  1 of  d0(u),  or  the  closeness  to 
D(u)  = u of  D_(u).  However  the  final  decision  about  parameter 
estimators  should  be  based  on  visual  inspection  of  the  graph  of 

Vu)- 


} r> 
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Another  consequence  of  considering  information  of  order 
a is  that  we  can  unify  the  estimation  criterion  used  to  form 
maximum  likelihood  estimators  with  the  estimation  criterion 
used  to  form  Gaussian  time  series  parameter  estimators: 


*sp  «'*,>  ■ l°s  I1  *•  • 


where  f and  fQ  are  spectral  densities.  It  is  comparable  to 


l2(d0)  - log  J1  1-  du 


o f0Q(u) 


_ X - 


V*  « 
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7 . Statistical  Inference  reduced  to  density  estimation 

The  quantile  approach  to  statistical  data  analysis  being 
developed  by  Parzen  [since  Parzen  (1979) J is  based  on  the 
proposition  that  conventional  problems  of  statistical  inference 
concerning  (1)  a random  sample  X^,...,X  , (2)  a bivariate 

sample  (X^.Y^) (Xn,Yn),  or  (3)  two  samples  X^,...,X  and 

Yl,...,Y  should  be  transformed  to  problems  of  functional 
inference,  estimating  and  testing  hypotheses  about  density 
functions  d(u),  d(u,,u2)  . . . . , dCu^,. . 

0 <u<l , unit  square  0<u^,u2<l,  unit  hypercube  0<u^,  . . . .u^l.  To 
illustrate  how  this  is  done  consider  the  following  problems. 

Modeling  Bivariate  Data  and  Tests  for  Indpenedence . Le t 
X and  Y be  continuous  random  variables  with  joint  density 
function  fx  Y(x,y).  The  hypothesis,  Ho:  X and  Y are  independent 
can  be  expressed 

Ho:  fx  y(x,y)  = fx(x)  fY(y) 

or  in  terms  of  information  divergence 
,oo  fv(x)fY(y) 

I(fX,Y;fXfY)  = [m  [J'l°s  ~fx  Y(x,y)}  fX,Y(x,y)  dx  dy 
by 

Ho:  I ( f X f Y "»  fxfY}  = 0 ' 

Define 

D(u^,u2)  **  fX>Y(QX(u^)  ,Qy(u2)) 


u^.) , on  the  unit  interval 


-i 


ii-i  t*— 
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d<ul‘u2>  “ aufdu^  d<u!‘u2> 

fy  y(Qx(ui)  *^Y 
fxQx 

We  call  dCu^,^)  the  quantile  dependence  density. 

The  hypothesis  Ho  can  be  expressed 

Ho:  DCu,,^)  * u^u2*  * 1. 

One  can  verify  that 

Il(fX,Y  ;fXfY^  “ Z1  Z*  *lo&  d(t»1,u2)}  d(u1,u2)  du1du2 
= - H1(d(u1,u2>) 

Thus  estimating  the  information  divergence  between  f^  y and 

^X^Y  eHuivaient  to  estimating  the  negative  of  the  entropy  of 
d(ultu2) . 

Estimators  d^Cu)  dependent  on  a finite  number  of  parameters 
can  be  formed  from  the  raw  estimator 

Y<QX*U1*  ’ Qy^^2^^* 

Modeling  likelihood  ratios  and  testing  equality  of 
distributions . Let  X and  Y be  continuous  random  variables. 

The  hypothesis 

Ho:  Fx<x)  “ Fy^’  or  Fx^x^  “ fY^X^ 


i 


I / 


can  be  expressed  in  terms  of  information  divergence 

fY(x) 

I(fY:fx)  - /_  -log  dFY(y) 

“ -log  d(u)  du 
o 

= -Hqd  (d(u)) 

defining  the  comparison  distribution  function  and  comparison 
density  function 

d fx(QY(u)) 

D(u)  = FxQy(u),  d(u)  - D(u)  - £y(qy(u)) 

Estimating  the  information  divergence  between  fy  and  f^  is 
equivalent  to  estimating  the  negative  of  the  entropy  in  the 
quantile-density  sense  of  the  comparison  density  d(u). 

8 . Parametric-select  density  estimation  and  Maximum  Entropy 
Densities 

A density  d(u)  = D' (u)  can  be  approximated  in  many  ways 
by  sequences  dm(u) ,m=l, 2, . . . of  functions  which  converge  to 

A 

d(u).  For  m-1,2,...,  let  dm(u)  be  an  estimator  of  dm(u) ; the 
sequence  dm(u)  then  estimates  d(u) . 

If  dmCu)  corresponds  to  a standard  finite  parameteric 
model  d(u)  for  which  one  could  consider  testing  the  hypothesis 
that  dm(u)  provides  an  exact  model,  we  call  d^Cu)  a parametric- 

A 

select  representation,  and  dm(u)  a parametric-select  estimator. 


t 
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to  indicate  that  we  are  free  to  select  the  number  of  parameters 
in  dm(u)  to  provide  an  adequate  approximation  or  representation 
of  d(u). 

We  call  dm(u)  a non-parametric  representation,  and  dm(u) 
a non-parametric  estimator,  if  dm(u)  does  not  correspond  to  a 
standard  finite  parameter  model  which  could  be  interpreted  as 
an  exact  model. 

An  important  criterion  for  developing  the  functional  form 
of  exact  models  for  densities  is  the  maximum  entropy  principles. 

A density  f(x),  -oo<x<<»i  which  maximizes  entropy 

H(f)  = /“{-log  f (x) }f(x)  dx  subject  to  constraints 
- 0 

/°°  T.  (x)  f(x)  dx  = r.,  j=I,...,k, 

-CD  J ^ 


where  Ti (x)  are  specified  functions  (called  sufficient  statistics) 
and  Tj  are  specified  moments  can  be  shown  to  have  the  representation, 
called  an  exponential  model, 

k 

log  f(x)  = l 0.  T.(x)  - y(  0i,  . . . , 0.  ) 
j=l  J J K 


where 

k 

. . . , 0k)=  lob  /“  exp  { l 0.  T.  (x)  ) dx 
-»  j=l  J J 


guarantees  that  f(x)  integrates  to  1. 

A quantile  function  q(u),  0<u<l,  which  maximizes  entropy 

Hqd(q)  = log  q(u)  du  subject  to  the  constraints 
o 


- -1- — 
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Z1  exp  (2uiuv)  f0Qo(u)  q(u)  du 

- — i = p(v),  v=0 ,+l , . . . ,+m 

JL  f-Q-(u)  q (u)  du 
o 

where  foQQ(u)  is  a specified  density  quantile  function  must  have 
the  representation,  called  an  autoregressive  model, 

q(u)  - qQ(u)  o’  |l+am(l)e2lTiu+...+am(m)e2lTium|  2 

9 . Exact-Parametric  and  Parameter-select  Estimation  of 

Probability  density  Functions  using  Exponential  Models 
Two  important  exponential  models  for  a dersity  f(x), 

-oo<x<«>  are  the  normal  density  and  the  gamma  density. 

The  normal  Jensi'y,  denoted  Normal  (y,c) 

Z (x)  = - oC— ) . 
u,o  7 oy'o  * 

1 12 
4>  (x)  = exp  - it  x 

/2¥ 

is  exponential  with  sufficient  statistics  T^(x)  = x and 
T2(x)  = x2. 

The  Gamma  density,  denoted  Gamma  (r,X)  where  X = l/o, 

«r,o  " 3 fr  <t>  • 

fr(x)  - Xr_1  e~x  , x>0  , 

*=0  , x<0  , 
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.t,  ci-poacnciai  with  sufficient;  statistics  T^(x)  e x and  T2(x) 
= log  x. 

A location  scale  parameter  Gamma  density 


f /x)  = I f 

rr.u.c  o V o ' 


is  not  an  exponential  model.  We  can  treat  it  as  one  by 
estimating  y (say,  by  tne  minimum  Xqj  of  the  random  sample 
Xl*  * • • *Xn)  » an<*  treating  X^-y  as  a sample  from  fr  c(x). 

The  hypothesis  that  the  data  is  fit  by  a normal  distribution 
versus  the  hypothesis  that  the  data  is  fit  by  a Gamma 
distribution  can  be  tested  by  forming  an  over-parametrized 
exponential  model  with  sufficient  statistics 


Tj^x)  = x.  T2(x)  = x2,  Tj(x)  = x3,  T^(x)  - log  x. 


The  (order  1)  information  divergence,  or  maximum  likelihood, 

A A A A 

estimators  6^,  e2,  6-j*  0^,  which  minimize  information  divergence 

of  order  1 -log  dQ(u)  du,  may  be  found  for  an  exponential 
o 

model  by  solving 


'i  - Ee[Tj! 


where  t.  = E.[T,]  is  estimated  by 
3 0 J 
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The  bi-information  divergence  estimators  0^,  0^*  0^, 

which  minimize  information  divergence  /^|log  dfl(u)j2du,  may 

o e 

be  found  using  least  squares  regression  analysis  techniques  by 
minimizing  with  respect  to  0jL>...,0jt  the  sum  of  squares 

n-1  - - _ 

l I log  f(x(j);  - {log  f(Xj)} 

j— 2 

-01  (Ti<x(j)>  - V-----0k  <Tk«0)>-  VI2 


Stepwise  regression  is  used  to  suggest  parsimonious  patametrizations . 

Graphical  procedures  to  determine  which  parameter  values 
fit  best  are  as  follows:  estimate  j=2,...,n-l,  by 

adding 

<*e(H+r>  " fG(X(j))  f *<X(j)> 

and  normalizing  the  sum  to  go  from  0 to  1.  One  inspects  its 
graph  to  see  how  it  deviates  from  D(u)  = u. 

10.  Case  studies  of  bi-information  density  estimation 

The  density  estimators  corresponding  to  the  bi-information 
parameter  estimates  of  the  normal,  gamma,  and  four-parameter 
exponential  models  are  presented  for  four  simulated  random 
samples : 

1)  Exponential  or  Gamma  (r  =■  1,  o *=  1) 

2)  Gamma  (r^lO,  o ~1) 


3)  Nornai  (y  “ 0,  a - 1;  , 

A)  Contaminated  normal:  100N(0 , 1) , 5N(10 , 1) 

In  addition  density  estimators,  using  bi-information 

parameters,  are  presented  for  the  data  set  of  Buffalo  snowfall. 

Bi-information  select  regression  estimation  of  the  parameters 

of  a 4-paramential  exponential  model  with  sufficient  statistics 
2 3 

x,  x , x , and  log  x leads  to  the  conclusion  that  Buffalo 
snowfall  obeys  a Gamma  distribution.  It  is  equally  well  fit 
by  a normal  distribution  whose  parameters  are  estimated  by 
minimizing  bi-information  rather  than  order  1 information. 

The  hypothesis  that  Buffalo  snowfall  is  normal  seems  to  be 
acceptable,  but  one  can  question  whether  the  maximum 
likelihood  estimators  (sample  mean  and  variance)  orovide  the 
best-fitting  normal  distribution  for  Buffalo  snowfall. 

As  in  Parzen  (1979) , we  reject  a trimodal  shape  probability 
density  estimate  for  Buffalo  snowfall,  which  has  been  found  by 
several  non-parametric  density  estimation  techniques; 
including  Tapia  and  Thompson  (1978) . 
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D(u)  FOR  BI-INF-DIV  GAMMA  (1.56, 


.53) 


ll 


I 


a 


1.10 


\ 

/ 


) 


PARSIMONIOUS  4- PARAMETER  EXPONENTIAL  DENSITY  HAS 
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Abstract  j 

This  paper  discusses  the  strong  consistency,  asymptotic  normality, 

and  asymptotic  efficiency  of  maximum  likelihood  estimates  of  the  para- 
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meters  in  a finite  mixture  of  multivariate  distributions,  as  well  as  the  ' 

asymptotic  theory  of  some  hypothesis  tests  for  such  mixtures.  ; 
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1.  Introduction 

The  use  of  multivariate  mixture  analysis  techniques  for  unsupervised 
classification  of  large  amounts  of  data  has  been  feasible  at  least  since 
it  was  proposed  and  implemented  by  J.H.  Wolfe  in  1970,  [341.  Prior  to 
that  time  estimation  of  parameters  in  a finite  mixture  of  unknown  com- 
ponent distributions  had  largely  been  confined  to  mixtures  of  a small 
number  of  univariate  distributions,  primarily  because  of  the  numerical 
difficulties  and  computational  requirements  of  parameter  estimation  in 
larger  mixture  models.  A variety  of  estimators  for  mixture  parameters 
has  been  suggested,  including  moment  estimators  (Pearson  [221  and  Rao  [26 D), 
graphical  methods  (Blishke  [41  Bhattaracharya  [31,  Cassie  l 6 1 and 
Harding  [151)  and  least  squares  and  min',  rhi  squares  estimators. 

However,  recent  attention  has  r>o  • ' ■>  ' een  focired  on  max'",!/"  likelihood 
estimation  (Day  [ g],  Hasse'i  tad  U7J.  Dick  [10],  Peters  ana  Coberly 
[24!,  and  Peters  and  Walker  [251.)  and  on  nonparametric  methods  (Murray 
and  Titterington  [211,  and  Hall  [141). 

As  shown  in  the  next  section  the  likelih^d  equations  for  mixture 
parameters  are  not  explicitly  ' >'.<  it>le  and  require  the  use  of  iterative 
methods  of  solution.  Because  there  rut,  be  multiple  roots  of  tne  likeli- 
hood equations,  one  is  concerned  that  the  iterative  method  chosen  con- 
verge to  the  "right"  solution,  i.e.,  a consistent  solution  if  one  exists. 
This  issue  is  discussed  by  Kale  T20],  and  also  by  Peters  and  Walker  125 J. 

For  mixtures  with  a known  number  of  components,  the  asymptotic  theory  is 
established  rather  easily  using  appropriate  generalizations  of  the  combined 
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results  of  Cramer  and  Huzurbazar  in  the  single  parameter  case  [8  3, 

[19].  For  mixtures  with  an  unknown  number  of  components  the  problem  is 
more  difficult  and,  in  particular,  the  large  sample  theory  of  tests  for 
hypotheses  about  the  number  of  components  has  not  been  worked  out  satis- 
factorily. These  issues  will  be  discussed  in  more  detail  in  Section  4 
The  use  of  multivariate  mixture  methods  in  the  analysis  of  remotely 
sensed  data  is,  for  the  most  part,  as  an  alternative  to  clustering. 

Used  in  this  fashion,  the  method  is  superior  to  most  clustering  methods 
in  ease  of  implementation  (with  certain  reservations)  and  in  economy  of 
output  - it  tells  the  investigator  only  the  most  important  facts  about 
the  data  distribution.  Thus,  the  usefulness  of  mixture  density 
estimation  in  this  mode  depends  solely  on  the  reality  of  some  prior 
classification  of  the  data  into  sjbpopulations  accurately  described 
by  the  given  parametric  family  of  component  distributions.  However, 
there  is  a growing  tendency  to  use  large  sample  considerations,  with 
samples  drawn  from  a multivariate  mixture  density,  as  a standard  for 
judging  clustering  methods  [113.  3y  this  standard  then,  provided  the 
expected  consistency,  normality,  and  efficiency  properties  hold,  the 
maximum  likelihood  estimate  of  mixture  parameters  is  the  ideal  atternative 
to  clustering. 

2.  The  Basic  Likelihood  Functions. 

Let  X be  a random  n-vector  which  is  distributed  according  to  a 

finite  mixture  density  of  the  form 

m 

f(x|  a, 9)  = l a.f,(x|  0.) 

i-l  1 1 1 


(2.1) 


of' POOR  quality. 

where  the  mixing  proportions  a.  > 0 are  unknown  parameters  satisfying 

m 

(2.2)  Z a.  = 1 

i=l  1 

and  the  f ^ (x | 0.)  are  distinct  members  of  parametric  families 
(fj(x|  0.)  [ 0.  e 0.)  of  density  functions.  For  the  remainder  of  this 
section,  we  assume  that  m,  the  number  of  components  in  the  mixture, 
is  known,  and  that  the  densities  f ^ (x } 0.. ) come  from  exponential 
families  ^ 

(2.3)  f.j (x j Oj)  = c^G^h^x)  exprq^G^)  • T^x)] 

where  0^  = EQ  CT.(X)]  is  the  mean  value  parametrization  and  0^  c 

an  open  subset  of  IRni  , [23.  Our  aim  is  to  investigate  the  consistency 
of  roots  of  likelihood  equations  for  the  parameters  (a,0)  , where 
a = (a1,...,am)  and  0 = (Oj,...^)  c Oj  x ...  x o^,  for  various 
types  of  samples.  Mixture  densities  arise  most  naturally  when  it  is 
known  that  X comes  from  one  of  m populations  and  that 

the  density  of  X giv..i  that  it  comes  from  P.  is  of  the  form  f ^ ( x | 0.). 
If  n e (l,...,m)  is  the  associated  random  variable  which  designates 
the  population  of  origin,  then  = Probm  = i].  The  r.v.  II  is  usually 
unobserved. 

Independent  unlabelled  samples:  (X^ ,JIj ),..., (Xp ,TIn ) are  independent 

and  identically  distributed  according  to  (2.1)  and  the  T.  are  unknown. 
The  corresponding  log  likelihood  functions  is 

n 

(2.3)  U(a,e)  - >:  log  f(Xl  n.O) 

1 j=l  J 

C - 


uKiuru  l rnuc  is 
OF  POOR  QUALITY 


Partially  labelled  samples:  Here  we  consider  two  sample  types,  both 

Introduced  by  Hosmer  [18],  and  studied  in  detail  by  him  and  Walker  1331; 
see  also  Redner  l 291. 

Type  1 - Fixed  numbers  M^,. . . .M^  of  samples  are  taken  independently 

of  one  another  from  each  of  the  component  populations  P.,...,P  . Let 

M,  m 

{X< j ) i=l  be  the  san'P^e  ^rom  Py  addition  a random  sample  Xj,...,Xn 

is  taken  from  the  mixture  (2.1).  The  log  likelihood  is 

m Mf 

(2.4)  L?(«.0)  = Z Z log  f.(X.  .|  0.)  + LJa.O)  * 

£ i=l  j«l  1 1J  1 1 

Type  2 - After  a random  sample  X^,. . . .X^  of  size  N + M is 

taken  from  (2.1),  the  originating  populations  of  X^+j XN+H  are 

determined  (with  no  error)  and  It  is  found  that  of  X^+j,...,XH 
come  from  P^,  i - l,...,m.  The  log  likelihood  is 

(2.5)  L3(«,0)  = log  fi-r  M!  || -y  aj1...o^m  + L2(«»0)« 

1*  • * ' m‘. 

In  this  expression  L2(a,0)  has  the  same  form  as  in  (2.4),  although 

M. ,. . . ,M  are  random. 

1 m 


Samples  in  blocks:  For  making  inferences  about  the  agricultural  makeup 

of  ground  areas  from  satellite  data,  certain  procedures  have  been  designed 
winch  automatically  delineate  sets  of  geographically  contiguous  measure- 
ments which  come  from  the  same  population  (Bryant,  [5  ]).  Thus,  the 

data  is  obtained  In  blocks  Xj  c X^,...,XjM  , j n l,...,p,  where  the 

J 


corresponding  II  j j,  have  a common  value  ITj  . Various  kinds  of  dependence 
can  be  assumed  within  each  block  leading  to  different  likelihood  functions 
of  the  form 
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yu 


(2.6) 


m 


L.(a,e)  = Slog  Z cx.f..(X  | e ) 
4 j=l  i=l  1 1J  J 1 


where  f..j(Xj|  9^)  is  the  joint  density  of  given  that 

n.  = i . In  deriving  (2.6)  it  is  assumed  that  the  size  of  the  block 

J 

N.  is  independent  of  Jl.,  which  may  require  careful  stratification  of 

J J 

blocks  by  size.  Finally,  we  remark  that,  in  applications,  samples  of 
each  type  are  frequently  degraded  by  missing  components  in  the  data 
vectors.  In  this  case,  a likelihood  function  like  (2.6)  is  appropriate 
provided  the  pattern  of  missing  components  is  independent  of  both  the 
population  of  origin  and  the  full  data  vector.  The  X.  in  (2.6)  become 

J 

the  vectors  of  observed  components.  Note  that  not  all  of  the  scalar 
components  of  0.  are  necessarily  identifiable  in  the  density  f . (X - 1 0.). 

* 1 J J 1 

The  simplest  model  (2.3)  well  illustrates  the  complications  of 
maximum  likelihood  estimation.  After  introducing  the  appropriate  Lagrange 
mulitpliers  and  setting  the  derivatives  of  L2(a,0)  equal , to  zero,  the 
following  likelihood  equations  are  obtained  (see  Hasselblad  C17  1 and 
Redner  [293). 


(2.7) 


(2.8) 


ai  = N 


N a,.f,(X,|  0,.) 
E 


= I r 't 


j=l  f(Xj.|  «.0) 


N fi(Xil  V , % / N fi<xil  6i) 

= E — J — T.(X-)  / E -5— J — 


j=l  f(XJ!  ct,0) 


j=l  f(X3|  a,0) 


In  addition  to  the  implicitness  of  the  likelihood  equation  a further 
difficulty  is  that  the  likelihood  function  may  actually  be  unbounded. 

For  example,  if  the  f(x|  0^  in  (2.3)  are  multivariate  normal,  one  of 
the  means  is  set  equal  to  a sample  value,  and  the  corresponding  covariance 


..(.'■‘■a.rfcd 
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1 


r 

i 


i 


matrix  tends  toward  singularity,  then  Lj  tends  to  infinity  (Duda  and  Hart 
C 1 23 ) . Redner  [293  shows  that  Lj  has  a global  maximum  if  a penalty  term 

m —IP  — 1 

-X  Z ||e  U is  added,  where  X , £ > 0 and  ||z7  ||  is  a norm  of  the 

i=l  i 1 

f'h 

inverse  of  the  i covariance  matrix.  For  partially  labelled  samples  the 
likelihood  function  is  bounded  provided  that  each  multivariate  normal 
population  is  adequately  represented  in  the  labelled  portion  of  the  sample. 


3.  Asymptotic  properties  of  the  mle  when  m is  known. 


Let  Xj,...,X  be  independent  random  variables  with  densities 

f^(x.|  e°)«  0°  e O.  an  open  subset  of  IRn  . When  v/e  say  that  there  is 

a strongly  consistent  maximum  likelihood  estimator  we  mean  that  given  a 

small  enough  neighborhood  U of  the  true  parameter  0°  the  probability 

is  one  that  there  is  an  interger  N.  such  that  for  N 5 N.  there  is  a 

3Ln(0) 

unique  solution  9^  of  — gg — in  U and  that  9^  locally  maximize 

Ln(0),  where 

N 

(3.1)  Ln(8)  = Z log  f.(X.|  0)  . 


j=l 


y y 


The  estimator  0^  is  asymptotically  normal  and  efficient  if 
Cj/2(0N  - 0°)  converges  in  distribution  to  H(0,I),  where  is  the 
Cramer-Rao  lower  bound 

(3.2) 


f-l.  r 'V6’ 

C„  - -E.rtL > 


‘N 


0L 


30 


’•  e=9° 


Under  the  regularity  conditions  to  be  assumed  this  is 


N 3 log  f (X  | 0)  3 log  f.(X.| 

= r r r * — J >L 

A 0ol  30  30 

J ^ 


0)1 


~]0=0° 


(3.3)  C"1 
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We  observe  that  for  all  the  sample  types  considered  in  the  previous 
section,  there  are  only  a finite  number  of  distinct  densitites  f (X - 1 0) 
to  be  considered,  whether  the  data  has  missing  components  or  not  (in 
sampling  by  blocks,  the  block  sizes  are  bounded.)  In  each  instance,  a 
straightforward  modification  of  the  following  theorem  and  its  proof 
suffices  to  establish  the  required  asymptotic  properties  of  the  model. 

Theorem  1.  Let  {g  - (y - 1 9)  | 1 s j <;  p ; 9 e 0}  be  a finite  set  of  para- 
J J 

metric  families  of  density  functions  with  the  same  parameter 'set  0,  an 
open  subset  of  Rn  . Let  Xj,  X2>  ...  be  independent  random  variables 
with  densities  fj(xj|  9°),  f2(x2|  9°)  where  each  f j ( x j I °°)  one 
of  (9j (y-| (0°) >j=l * SuPP°se  each  g(y|  0)  « {9j(y|  O) > j=1  satisfies 
the  condition. 

a.  there  is  a neighborhood  U of  9°  such  that  for  all 
9 c U and  almost  all  y II  11  5 hj(y)  , 

||^LG)ll  < h„(y)  and  H^g3.9lxUj||  < h (y). 


39 


39" 


where  hj  and  h2  are  integrable  and  J h3(y)g(y|  9°)  dy 
Suppose  that  there  is  a positive  number  e such  that 


ic"1 

n N 


1 


N 


N jf.V 


3 log  f .(X . | 9)  9 log  f.(Xl  0)T 


Jli 


30  39 

2:  e I for  sufficiently  large  N. 


]9=0° 


i ! 
\ * 


n 

i i 


Then  there  isa  strongly  consistent  solution  0^  of  the  likelihood  equation 

3 N 


0 = fg  £ log  f,(X . | 9)  . 
39  . 3 3 
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Furthermore,  6^  is  asymptotically  normal,  0^  ~N(0°,C^),  and 
efficient. 

Condition  (a)  and  (b)  are  very  similar  to  those  of  Chanda  [ 7 ],  who 
generalized  to  the  multiparameter  case  the  theorems  of  Cramer  [8  ] and 
Huzurbazar  [193.  For  a proof  of  theorem  1 see  Foutz  L133,  Peters  and 
Walker  [253,  and  Peters  [233. 

Returning  to  the  mixture  density  likelihood  functions  of  the  previous 
section  with  component  densities  f ^ (x|  0^)  = c(0..)h.j(x)  expC (0 ^ ) • T^(x)3 
assume  that  each  pattern  of  missing  components  in  X manifests  itself  in 
a certain  pattern  of  missing  components  in  T. (X)  (as  in  the  multivariate 
normal  distribution).  For  a sample  of  size  k,  let  4>( i , j , k)  denote 
the  relative  frequency  with  which  the  component  of  (X)  is  observed. 

The  next  theorem  is  stated  for  fully  observed  data  vectors;  however, 
it  remains  valid  for  data  with  missing  components  provided  that  for  each 
i and  j Vim  <f>(i,j,k)  > 0 for  any  sample  of  size  k tending  to 

k-too 

infinity  (see  Peters  [233  and  Redner  [293). 

Theorem  2.  Suppose  the  functions  (exp^fo^)  • Tj(x)3)j_j  together 

with  the  component  functions  of  (T^x)  exprq.fO^)  • T^(x)]}^=1  are  all 

linearly  independent.  Then  there  is  a consistent,  asymptotically  normal 

and  efficient  mle  of  (a°,0°)  for  L,(c,0)  as  N -»•  °°,  for  L2(a,0)  as 
Mi  , , 

N ■*  00  and  each  remains  bounded,  and  for  L^(a,0)  as  M + N •*■  ». 

4.  Mixtures  with  an  unknown  number  of  classes. 

If  the  number  m of  classes  in  the  mixture  density  is  among  the  para- 
meters to  be  estimated,  then  the  results  of  the  preceding  section  no  longer 
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apply.  It  is  easy  to  see  that  the  likelihood  function  Lj(a,0)  can  be 
made  arbitrarily  large  if  m is  taken  to  be  the  sample  size.  For  partially 
identified  samples  leading  to  L^a.O)  the  number  of  classes  is  eventually 
determined  as  M (the  number  of  identified  samples)  tends  to  « ; however, 
because  of  the  expense  of  labelling  samples,  one  would  like  tc  be  able  to 
include  m as  a parameter  in  (ot,0) . An  approach  which  has  had  some 
success  in  applications  is  the  quasi-Bayesian  approach  used  by  Rassbach 
[271,  which  will  not  be  discussed  here,  although  it  has  some,  similarities 
to  the  use  of  the  Akaike  information  criterion  proposed  by  Redner  and 
Coberly  [30 1- 
Suppose 

m 

(4.1)  f (x | a,m,tf>)  = I a.f.(x|  0.) 

i=l  1 1 1 

is  a mixture  density  family  with  parameters  a,  m and  i|i  = (°i»...,Gm) 
satisfying 

(4.2)  1 s m s m 

m 

E a.  = 1 ; a.  & 0 
i=l  1 1 

0j  £ 0 > 

a compact  subset  of  IRn  . Since  the  parameter  space  is  compact  we  could 
consider  global  maxima  of  the  likelihood,  except  that  unfortunately  the 
parameters  are  no  longer  identifiable,  even  locally.  This  is  a consequence 
of  the  particular  compact  parametrization  chosen  and  not  of  any  inherent 
non-identifiabil ity  of  finite  mixtures  (Teicher  T31  1 and  Yakowitz  [351). 
Redner  adapted  Wald's  consistency  theorem  (Wald  [321)  to  show  that  if  F 
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is  the  set  of  all  finite  mixtures  (4.1)  satisfying  (4.2),  then  under 

certain  conditions,  if  Xj»  X2>...,XN  are  i id  from  f°  e , then  there 

is  a unique  mle  f^  e F^.  satisfying 

N ,,  N 

Z log  f (X.)  = max  Z log  f(X.) 
j=l  3 feF  j=l  3 

N o 

and  with  probability  1,  f (x)  -*■  f (x)  for  each  x as  N ■*  °°,  except 
perhaps  for  a null  set  depending  only  on  f°  (Redner  [28 J). 

For  estimating  m,  which  is  frequently  of  independent  interest,  it 
is  necessary  to  further  restrict  the  parameters  as  follows:  Replace 

conditions  (4.2)  by 

(4.3)  m = in 

m 

Z a.  = 1 , o.  s c-  > 0 or  o.  = 0 
i=l  1 1 1 1 

0.  £ 0,  a compact  subset  of  3Rn  « 

8(0i,QJ)  e2  > 0 for  i f j 

where  S:  0 * 0 + 10,“)  is  a continuous  function  such  that  f(*|  0)  = 
f ( * ) O')  if  and  only  if  8(0,0')  =0.  A good  example  of  8 if.  the 
Bhattaracharya  coefficient  8(0,0')  = 1 - Jl f (x | 0 ) f (x j 0’)J2dx.  Assume 
that  0 is  identifiable  in  f(x|  0). 

Theorem  3 (Redner  [28]).  Let  Xj,  X^,  ...  be  independent  samples  from 
a mixture  density  f (x J a0,  m,  tf>°)  of  type  (4.1)  subject  to  conditions 

(4.3) .  Let  Nr ( 0 ) be  the  closed  jail  of  radius  r at  O . Suppose 
the  family  {f(x|  0)-Ve 


satisfies  the  conditions: 
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(i)  | log  f*(x,e,r)  f(x|  0')dx  < ~ for  sufficiently  small 

r = r(6,0‘ ),  where  f*(x,0,r)  = max{l,  _ sup  f(x|  0)} 

0cflr(6) 

(ii)  for  each  0 there  is  a null  set  S0  such  that  for  all 

x l Sfl,  lim  f(x|  0' ) = f(x|  0}  . 

0 o'+e 

and 

(iii)  j | log  f (x | 0')|  f(x|  0)dx  < « for  all  0,0'  . 

Then  the  qlobal  mle  (“n»m»^n)  is  a strongly  consistent  estimator 
of  (a°,m,^°).  In  particular,  with  probability  one  the  number  of  nonzero 
components  of  &n  is  eventually  the  correct  number  and  for  the  exponential 
families  (f(x|  0)}  discussed  in  the  Section  3,  (an>  $n)  is  asymptotically 
normal  and  efficient. 


Wolfe  C34] suggested  a hypothesis  testing  approach  to  determining  m, 
where  the  null  hypothesis  is  that  the  mixture  has  m components  against  the 
alternative  of  m + 1 components.  Specifically,  let  X^,..Xn<  be  a sample  from 
f (x)  where 

m 


m 

a.  > 0 , T,  a.  = 1 
1 i=l  1 

0.,,...,  0m  are  distinct  elements  of  0 


m 

H,:  f(x)  = Z n. f (x|0  ) 

' i=l  ' 

m + 1 

a.  > 0 T a*  - 1 

1 i=l  1 

e 0 , . are  distinct  elements  of  0 

1 ’ m + 1 


i 


are  the  null  and  alternative  hypothesis  and  0 is  an  open  subset  of  IR  . 
N N 

Let  f and  f , •,  be  consistent  mle  s of  f under  Ho  and  H, 
m m + I 1 

respectively.  Wolfe  bases  his  test  on  the  assumption  that  under  Ho  the 
likelihood  ratio  statistic 


A«  * 2jS,  l09  f!  ♦ 1 (Xj>  - 2,5,  ’°9  f!  <V 

2 

has  an  asymptotic  x distribution  with  d.f.  = n + 1 as  N -*■  «. 

Unfortunately,  this  does  not  always  seem  to  hold  (Hartigan,  L16j).  Hartigan 

2 

suggests  that  Is  stochastically  smaller  than  x » which  would  be  a 

N n + 1 

2 

true,  since  then  an  upper  bound  at  least  for  the  size  of  the  x test  would 

be  known.  Apparently,  the  m-class  model  cannot  be  embedded  in  the  m + 1 

class  model  regularly  enough  so  that  the  classical  asymptotic  theory  is  valid. 

Finally,  Redner  and  Coberly  L 30 J have  suggested  using  the  Akaike 

information  criterion  to  estimate  the  number  of  components  i-n  the  model,  whereby 

m,  a,,...,a  20  and  0, ,...,0  e 0 are  chosen  to  maximize 
1 m 1 m 


(4.4) 


AIC  = L - k 
m m m 


where  L is  the  maximum  loq  likelihood  for  the  m-class  model 
m 3 

a N N u 

(4.5)  L = max  E log  f (X  ) = E log  f"  (X  ) 

m f e F J=1  J J=1  m J 

m 


and  is  the  number  of  free  parameters,  namely 


k = mn  + m - 1 . 
m 


(4.6) 


N N l 

If  m is  the  true  number  of  components  and  f",  + are  consistent  I 

mle's,  and  if  the  likelihood  ratio  statistic  has  a X„  + ^ distirbution, 

then  + ^ - AIC^  J=  - asymptotically.  The  use  of  the  Akaike  1 

criterion  then  is  subject  to  the  same  reservations  as  the  use  of  the  x2_test.  f 

although  there  is  no  question  of  its  utility  in  providing  as  adequate  and 

r 

economical  description  of  a given  data  distribution.  ■ 

t 

) 


1 'A.r*v.  lik 


- - JV  ' 
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Summary 


The  statistical  properties  of  a cubic  smoothing  spline 
and  its  derivative  are  analyzed.  It  is  shown  that  unless  un- 
natural boundary  conditions  hold,  the  integrated  squared  bias 
is  dominated  by  local  effects  near  the  boundary.  Similar  effects 
are  shown  to  occur  in  the  regularized  solution  of  a translation- 
kernel  integral  equation.  These  results  are  derived  by  developing 
a Fourier  representation  for  a smoothing  spline. 
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1.  Introduction  and  Summary 


We  consider  statistical  properties  of  smoothing  splines  and  related 
procedures.  Given  « f(t^)  + e^,  1 ■ l,,..,n  where  g is  an  unknown 
smooth  function  and  the  are  random  errors,  a cubic  smoothing  spline 

g(t;X)  is  the  function  which  minimizes 

iSu 

n i-1 

Smoothing  splines  were  proposed  by  Whittaker  (1923),  Schoenberg  (1964),  and 
Reinsch  (1967).  Some  analysis  of  their  statistical  properties  in  the  case 
that  g and  f are  periodic  appears  in  Wahba  (1975)  and  Rice  and  Rosenblatt 
(1981).  The  method  of  cross  validation  for  choosing  the  smoothing  parameter 
X from  the  data  has  been  discussed  in  Craven  and  Wahba  (1979). 

Smoothing  splines  may  be  viewed  in  a larger  context.  Given  x^  ■ 
(Af)(t^)  + where  A is  a linear  operator,  a "regularized”  estimate  of 
f is  the  function  g which  minimizes 


j-go^r 


/ 


+ * / (g-Tor  at 


(i.i) 


[Xj-(Ag) (t^) ] ‘ 


+ X 


/ (8~( 


•(t))  dt 


(1.2) 


Frequently  Af  is  of  the  form 


(Af ) (t) 


J k(t,s)f(s)ds  . 


(1.3) 


Many  examples  of  this  type  may  be  found  in  Tikhonov  and  Arsenin  (1977). 

The  method  of  regularization  is  used  to  control  the  instability  that  would 
arise  if  one  tried  to  invert  A or  A*A.  The  regularized  solutions  have  a 
formal  resemblance  to  ridge-regression  estimates;  in  both  cases  the  variance 


'iiUiV'. 


r-}  ^ * 


inR 


\ 


of  the  estimate  is  reduced  at  the  cost  of  increasing  bias.  Although 
there  is  a large  literature  on  this  topic,  there  has  been  relatively  little 
analysis  of  the  statistical  properties  of  the  solutions. 

In  this  paper  we  examine  two  cases  of  (1.3),  numerical  differentiation 


(Af ) (t) 


f (u)du 


(1.4) 


and  deconvolution, 


(Af ) (t) 


w(t-s)f(s)ds  . 


(1.5) 


We  next  summarize  and  discuss  our  main  results.  Derivations  and  some 
further  results  are  contained  in  later  sections.  We  first  deal  with  a 
cubic  smoothing  spline. 

Consider  observations 

x^  = f (k/n)  + e^,  k = 0,1, ...,n 

2 

with  f continuously  differentiable,  £ L and  the  c^  random  variables 

with 


E Gk  ~ 0 

E Vj  = 5k,j  a2’  °2  > 0 * 

We  wish  to  determine  a continuously  differentiable  function  g = g(t;X,n) 
2 

with  g-* £ L that  minimizes 


( n-1  v 

n ji  (X0+Vg(0)~8(1))2  + £ (*k-g(k/n))2j 

+ * / (g-Xt))2  dt  . 

JO 


Here  X - X(n)  >0  and  the  object  is  to  determine  X(n)  as  a function 
n so  that 


E[g(t)-f(t)r  dt 


tends  to  zero  as  n -*■  ® at  a rapid  rate.  The  term 

2 (x0+5Cn)  “ 1 (8(0)+g(l)) 

appears  in  (1.6)  because  one  wishes  to  allow  for  the  possibility  that 
f(0)  t f(l)  and  in  that  case  the  Fourier  series  of  f(t)  will  conver 
| (f (0)+f (1))  at  t = 0,1. 


converge  to 


l*le.oren  1-  Let  f € C2.  If  X3(n)n8  - X(n)  - 0 as  n ->  « then 

j °2fg(t)  ]dt  3*2“7/2  . 

2-  Let  f ^ C4.  Assume  that  X3(n)n8  ->  X(n)  - 0 as  n - ». 

Then  if  f(2)(0)  or  f*23(i)  f 0 


[E  g(t)-f(t)]2  dt~{(f^2^(0))2  + (f^(l))2}  X5/4  2~3^2 


while  if  f(2)(0)  = f(2)(l)  - o but  f(3)(0)  + 0 ar  f(3)(i)  + o us  have 


t 
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g(t)-f(t)]2  dt  a ((f(3)(0))2  + (f(3)(i))2} 


x?/4  3#2-3/2  . 


A common  reason  for  nonparametric  data  smoothing  is  to  calculate  an 
extimate  of  the  derivative  of  a function.  Schemes  for  numerically  differ- 
entiating noisy  data  that  are  closely  related  to  the  derivative  of  a smooth- 
ing spline  have  been  proposed  in  Cullura  (1971)  and  Anderssen  and  Bloomfield 
(1974).  The  properties  of  the  derivative  of  a smoothing  spline  follow 
fairly  directly  from  the  properties  of  the  smoothing  spline  itself. 

Theorem  3.  If  f € C and  if  Xn  -*■  » as  n ■*  “ and 
X -*•  0,  then 

f V(t))Jt  - I'3'*  • i'7'2  * o („-V3/2>  . 

Jo  n 

Theorem  4.  Assume  that  f € C4*  and  that  Xn3  <».  Then  if 
f(2)(0)  i!0  or-  f(2)  + 0 

[ [E  g'(t)-f'(t)]2  dt  2?  [(f(2)(0))2  + (f(2)(l))2]  • X3M  • 3 • 2"3/2  : 

Jo 

If  f(2)(0)  - f(2)(l)  - 0,  but  f(3)(0)  or  f(3)(l)  s4  0 then 


[E  g'(t)-f'(t)]2  dt  ~ [(f(3)(0))2  + (f(3)(l))2]X5/4  • 3 • 2~3/2  . 


Comparing  these  results  to  Theorems  1 and  2 we  see  that  the  variance  and 

-1/2 

integrated  squared  bias  of  the  derivative  are  a factor  of  X larger  than 

the  variance  and  integrate  square  bias  of  the  function  itself. 
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Theorem  2 shows  that  the  integrated  squared  bias  is  dominated  by  con- 

tributions  from  the  boundary  unless  g satisfies  the  condition  g'  '(0)  » 

00 

g (1)  ■ 0,  k ■ 2,3.  Lemma  6 of  section  3 gives  a local  approximation 

to  the  bias  in  the  case  that  these  conditions  are  not  met.  Roughly,  the 

bias  decays  like  exp  (-2  ^^t)  trigonometrically  modulated.  In  the 

2 

interior  of  [0,1]  the  squared  bias  is  proportional  to  X . 

These  results  are  not  unexpected.  The  smoothing  spline  is  a "natural"  ! 

I 

\ 

spline  and  satisfies  the  two  arbitrary  end  conditions  f~(0)  - f~(l)  » 0.  ] 

In  the  context  of  pure  interpolation  the  use  of  a natural  spline  is  usually 

2 i 

not  recommended  since  the  error  near  the  ends  is  of  order  h where  h 

is  the  mesh  size  whereas  other  methods  can  produce  an  error  uniformly  of  ; 

order  h\  if  f € C^*,  de  Boor  (1978),  Powell  (1981).  Similarly,  it  can  j 

I 

I 

be  shown  that  the  boundary  effect  dominates  the  integrated  squared  error,  , 

Rosenblatt  (1976).  In  the  nonstochastic  framework  methods  of  estimating 
the  boundary  constraints  have  been  proposed  in  these  references  and  it 

, I 

would  appear  plausible  that  a similar  approach  might  work  in  the  stochastic  ; 

! 

case. 

Natural  splines  in  the  nonstochastic  setting  and  smoothing  splines  in  ; 

the  stochastic  setting  are  the  optimal  solutions  of  certain  minimax  problems, 

Powell  (1981)  and  Speckman  (1981).  It  appears  that  flexibility  is  lost  by 
guarding  against  worst  cases. 

Smoothing  splines  have  also  been  proposed  in  the  case  of  spectral 
density  estimation  (see  Cogburn  and  Davis  (1974)  and  Wahba  (1980)).  Boundary 
effect?  similar  to  those  studied  here  occur  in  the  case  of  periodic  smoothing 
splines  unless  the  function  is  smoothly  periodic  (see  Rice  and  Rosenblatt  (1980)). 


_.j 
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The  aliasing  in  the  case  of  spectral  analysis  of  discretely  sampled  data 
implies  that  boundary  behavior  will  not  be  smooth  in  this  context. 

In  the  deconvolution  problem  we  consider  observations 

x^  - F(k/n)  + e^,  k - 0 n 

X 2 

where  F(k/n)  «*  ^ w(k/n-u)f  (u)du,  with  f ■*"  € L and  the  uncorrelated 

2 

random  variables  with  mean  0 and  variance  a . The  regularized  approxima- 
tion to  f Is  the  function  g that  minimizes 


~ (x0+xn-G(0)-G<l))2  + i 2 (xk-G(k/n)): 
v ‘ k=l 


(1.7) 


r 

+ x / <g~(t>r  dt  . 

Jo 

Here  G(k/n)  «*  J ^ w(k/n-u)^(u)du.  The  kernel  of  the  integral  equation,  w, 

is  the  periodic  extension  of  a function  defined  on  [0,1],  and  it  is 
2 

assumed  that  w € L . We  assume  that  the  Fourier  coefficients  w^  of  w 
are  nonzero  for  all  k. 

The  constants  that  occur  in  the  asymptotic  expressions  for  the  compo- 
nents of  the  integrated  mean  square  error  depend  on  the  exact  form  of  w, 
but  the  rates  of  decrease  depend  only  on  the  rate  of  decrease  of  the  Fourier 
coefficients  w^  of  w.  Paralleling  Theorems  1 and  2 we  have 

2 2 -2B  2 S+ 3 

Theorem  5.  Let  f € c and  suppose  that  |w^|  ~k  D,  6 > 0.  If  Xn  -*■ 


V*f 


OF  POOR  QL'AuT  { 


jo2[ g^ldt-a"1  X-(2^l)/(2B+4)  # 

Theorem__6.  Let  f € C4  and  suppose  that  |wk|2  ~ k~28,  6>0  and  Xn26+3  -» 
as  n -*■  ~.  Then  if  f^O)  or  f~(l)  + 0 

/[ E g(t)-f  (t)]2  dt  ~ x5/(2B+4)  . 

1/  f^o)  = f~(l)  =*  0 but  f(3)(0)  or  f(3^(l)  0,  then 

/[ E g(t)-f(t)]2  dt  ~ x7/(2B+4)  . 

1/  f (k) (0)  - f(k)(l),  k = 2,3  then 

fl F g(t)-f(t)]2  dt  ~ X8/(2B+4)  . 

Analytic  expressions  for  the  approximate  local  bias  are  not  available, 
but  the  qualitative  behavior  is  similar  to  that  of  a smoothing  spline. 


Note  that  if  w is  very  smooth,  8 is  large,  and  the  integrated 
mean  square  error  will  tend  to  zero  relatively  slowly. 


2.  Examples 


The  function  f(t)  - cos  (2*t)  + A cos  (itt)  satisfies  f~(0)  - -8ir  , 
f~tl)  ■ 0,  f‘***(0)  » f***tl)  ■ 0.  Figures  1 and  2 show  the  exact  bias  of 
the  smoothing  spline  estimate  of  the  function  and  its  derivative  for  50 
equi-spaced  sampling  points  and  X ■ 10  The  effect  of  f^XO)  is  clearly 
evident.  The  asymptotic  analysis  (Lemma  6)  predicts  that  the  bias. 


b(t)  a exp  (-t  2_1/V1/4) 

•[.in  (t  2"1/2r1M)  - co.  <t  2-^2X-1/4)]  . 


From  this  expression  we  see  that  the  first  zero-crossing  of  the  bias  should 

occur  at  t ■ irX*^  2 ■ .035  and  that  b'(t)  should  be  zero  at  t * 

nX^^  2 ■ .070,  which  is  borne  out  in  Figure  1.  Figure  2 shows  that 

-1/A 

the  bias  of  the  derivative  is  larger  by  a factor  of  about  X 

We  next  consider  the  deconvolution  problem  wherein  f is  convolved 

with  a function  w,  the  graph  of  which  is  an  isosceles  triangle  centered 

at  0 with  height  20  and  base  .A.  This  is  intended  to  correspond  to  a 

situation  in  which  averaged  values  of  f are  measured  with  error.  Since 

the  analysis  of  section  A requires  that  w be  periodically  extended,  the 

triangle  is  also  centered  over  -1  and  1.  To  calculate  the  bias,  (1.7) 

was  discretized  assuming  25  equi-spaced  observations  and  the  solution  was 

computed  at  50  equi-spaced  points.  Other  mesh  sizes  were  tried  to  insure 

that  the  results  did  not  merely  reflect  the  discretization.  The  calculations 

were  done  on  a VAX  11/80  in  double  precision.  Figure  3 shows  the  bias  for 
_8 

X **  10  ; there  is  a clear  effect  near  0 and  also  an  effect  near  1.  The 


shapes  are  qualitatively  similar  to  Figure  1. 


/ 
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Since  the  assumption  that  w is  periodically  extended  is  clearly 
somewhat  artificial,  we  also  computed  the  bias  for  w just  correspond- 
ing to  a triangle  centered  over  0.  The  resulting  bias  is  shown  in  Figure  4 
Here  the  only  effect  is  near  0;  the  effect  near  1 of  Figure  3 is  apparently 
due  to  the  periodicity  of  w. 
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3.  The  Smoothing  Spline  and  Its  Derivative 

In  this  section  we  derive  Theorems  1-4  and  some  auxiliary  results. 

In  order  to  do  this  we  carry  out  a Fourier  analysis  of  the  smoothing  spline. 

Notice  th.-'t 


j e2lTiktg(t)dt  *»  Ag°ak  - Ag1^  + h^ 


(3.1) 


for  k j*  0 where 


Ag  * g(l)  - g(0) 


Ag  - g'(l)  - g'(0) 
2nikt 


i: 


\ - J e g'TtJdt 

k /0 


Let 


and  set 


12  (x0+Xn)  if  j ” 0 


if  j » 1, . . . ,n-l 


n-1 


7S 

/n  j»0  J 


exp  (27tijk/n)  . 


(n) 


Given  a sequence  of  coefficients  we  will  let  p^  denote  the  correspond- 


ing set  of  aliased  coefficients  arising  in  a discrete  Fourier  analysis 
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Also  lot 


pk+sn’  k " °>1 n_1  * 


= (n)  . Jn) 


p0 


0 (n)  , . 1.  (n)  . , , 

5 8*.  + Ag  b.  » k « !»..*»  n— 1 « 


(3.2) 


Lemma  1.  Let  f and  Ag°,  Ag1  be  gzven.  Assume  that  f,g  arc  continuously 

2 

differentiable  with  L . Then  the  function  g minimising  (1.1)  is 

determined  by  the  following  specification  on  Fourier  coefficients: 


y0  0-(n)  . . 1.  (n) 
gQ  - — - aQ  + Ag  bQ  , 

/n 


h - 0 for  s f 0 , 
sn 


(3.4) 


^M-sn  X+r,  ^k+sn  Zk 
k 


(3.5) 


for  k - l,...,n-l  and  integral  e.  Here  it  ie  understood  that 


rk  " 53  (2n(k+sn))4  . 


The  IMrsoval  relation  implies  that  (1.1)  cun  be  rewritten  as 


(3.6) 


V«l\ivai!  to*-  • rvx- 


OF  POOR  Q'JALl  i / 


r-  “ 80  " 
1 »n 


i(n' 

80 


+ 2J 

k-l 


* 0 (») 
“A8ak 


,A<">  - 


<\V 


(n) 


In  minimizing  this  expression,  one  can  separately  minimize  the  sum  of  the 
terms  with  k fixed  for  each  value  of  k.  Minimizing  for  k = 0 leads  one 
to  (3.3)  and  (3. A).  For  k j*  0 we  have 


XVS„ ' (Sk  - <W(",)bk«. 


(3.7) 


Multiplying  by  bj.+sn  and  summing  over  s leads  to 


(h  b )<">  „ Vk 

k k X+r. 

k 


(3.8) 


I i 

i i 


and  this  together  with  (3.7)  leads  to  (3.5). 

Lemma  2.  Insert  (3.3),  (3.4)  and  (3.5)  in  (3.6).  Minimizing  the  resulting 
expression  with  respect  to  hg* t Ag1  leads  to 


! 

i 


( n-1  - 

) 

(n-1  )-l 

« ; 

( k-l  *^n 

! (X+rk)j 

|akn)  |2/(X+rk)( 
( k»l  ) 

(3.9) 

and 

agi  • ■ \t I ' ( +rk> j I1 + £ • 

(3.10) 


ORIGIN/1  L ;; 

{-.I- 

*»  ' /*•">» 
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If  we  insert  (3.3),  (3.4)  and  (3.5)  in  the  expression  (3.7),  the  result 
can  be  written  as 


n-1  | * | 2 

Wl2  + x £ . 

k-1  rk 


(3.11) 


Minimizing  this  expression  with  respect  to  Ag®  and  Ag*  leads  to  the 
following  equations 


n-1  * — (n) 

Szkak 

X+r, 
K=i  k 


(3.12) 


n-1  - (n) 

z,  b. 


1 • E 'k"k 


Ag  + 


, , X+r, 
k-l  k 


0 . 


(3.13) 


On  solving  for  Ag®,  Ag^  the  expressions  (3.7)  and  (3.10)  are  obtained. 


Lemma  3.  The  function  g minimizing  (1.1)  hao  Fourier  coefficients 


, m 10  . . lr(n) 
5q  = ~ + &g  bQ 


(3.14) 


8sn  " i8°asn  “ Ag\n  foV  3 * 0 


(3.15) 


and  for  k - l,...,n-l  and  s integral 


8 


k+sn 


.0,  1 i.  1 2 (n)  ■, 

8 ^ak+sn  ” X+r,  ^bk+sn^  ak 
k 

“ 68  fbk+sn  ~ X+rT  ^k+sJ  bk  ^ 
k 


lb 


, |2  * 
k+sn 


X+r, 


k /n 


(3.16) 


+ 


I I f 


Since 
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(3.20) 


(3.21) 


n-1 

E i4n) 

k=l  K 

I2  / (X+rk)  “ !2,,k!2/  <xl27tkl 

W# 

^ X_3/4  Cx  , 

E i4n)I2/  <X+rk)2“X"7/4  C2  • 

E i»kn,i2/  (Wrk>  si'1/4c3  • 

E lbk”,|2  / <Wtk)2  ^ X"5/4  C4  . 

k-1  ' 

(3.22) 

(3.23) 

(3.24) 

(3.25) 


where 


/ 

/ 


l2™i2- 

| 2ixx  j 4+l 
dx 

j 2itx|4+1 


dx  , 


/ 

/ 


I 2nx 1 6 
( | 2nx|Vl)'1' 

U™!4 

(|2*x|4+l)2 


dx 


* 


dx 


qrigwal  face  10 


if  A(n)n 


as  n ®,  it  can  be  seen  that 


o2(&g°)  S ~ C2C“V1/4  , 
o2(Ag1)  S ^ C4C32X“3/A  , 


(3.26) 


(3.27) 


if  A(n)n  -*•  “ as  n The  term 


n-1 

EE 

a k«I 


T-*.  ak+sn 
k«i 


L_  Ik  |2a(n).2 

A+r.  ' k+sn  ’ k ■ 

k 


(3.28) 


2 0 

occurs  as  a coefficient  of  o (Ag  ) in  contributing  to  (3.18).  However, 
(3,28)  can  be  approximated  by 


£-±2nikl°-  1/4 

k (A|2*kl4+ir 


(3.29) 


1 4 

with  an  error  0 — if  A(n)n  -v  <»  as  n -*■  “.  The  term 
n 


£ E IV.„-TOrlV,nl2^n)l2 

s k"l  k 


(3.30) 


2 1 

arises  as  a coefficient  of  o (Ag  ) in  contributing  to  (3.18).  An  estimation 
procedure  similar  to  that  used  in  arriving  at  (3.39)  shows  that  (3.30)  can  be 
approximated  by 


£ a c 

k (A|2»k|A+l)2  4 


(3.31) 


Or  rvJwu  » 


1 1 3 8 

with  an  error  0 —A  if  X (n)n  -*•  * as  n -*■  The  estimates  obtained 
■n  ' 

for  (3.28)  and  (3.30)  imply  that  the  contribution  to  (3.18)  from  the 
terms  involving  Ag®  and  Ag*  in  (3.16)  is 


3 8 

if  X (n)n  ■+  “ as  n “.  Now  consider  the  contribution  from  the  last 
term  on  the  right  of  (3.16).  We  shall  see  that  it  makes  che  major  contri- 
bution to  the  integrated  variance.  The  expression 


y A IWft  i 


(3.32) 


can  be  approximated  by 


2 

0<|k|<| 


1 1 
( 1 2nk  1 4+l)  2 n 


(3.33) 


where 


/ 


dx 


( | 2nx]  +1)‘ 


ft 


with  an  error  0|r-|  if  X(n)n  -*■  ” as  n -*• 
from  these  estimates. 


Theorem  1 follows 


Our  next  object  is  to  derive  Theorem  2 for  the  integrated  squared  bias 
of  g as  an  estimate  of  f.  Notice  that  for  k i*  0 we  have 


w£k.j 


; ; ^‘urtLitY 


fk  " / e2nlktf(t)dt  - Af°a.  - A + 


k + V 


(3.34) 


\ " / e2lTiktf-1(t)dt  . 
JO 


(3.35) 


Using  (3.34)  it  is  clear  that 


" n-x 

7 (f(o)+f(i))  - Effc-E  f,!n)  , 


k»-»  k=0  K 


f(j/n)  - E f.  exp  (-2nijk/n) 
n-1 

■ iC  exP  (”2itijk/n) , j *1,, 
k-0  K 


(3.36) 


i . »n-l 


This  implies  that 


'j/^“fj(n)>  j “ 0,1 n-1 


(3.37) 


From  (3.34)  it  follows  that 


+ j-x 


Relations  (3.9),  (3.10),  and  (3.37)  imply  that 


- *«•  i „ 


i kt* 


jO-mO+JE  <Vk)  <"),<»>/  <X«k)j 


(n-1 

E i^’i2/ 

(k-l  K 


(X+rk) 


E Ag1  » Af1 

1 " ‘ 

ji  + 2 i^n)i2/ 

1 <*«k> 

l-ll 

! 

. 

[ k-1  K ' 

1 

\ 

( yi 


•_  v s(n).  (u) 

“kV  bk 


/ (X+rk>( 


i + L ibkn)i2/  ^+rk} 

( k-1  * ' K 


Since  we  are  dealing  with  real-valued  functions  f it  follows  that 

°k  " "-k 


(mkV(n)  “ <m-kb-k)(n) 


These  last  two  relations  together  with  (3.38)  and  (3.39)  imply  that 


E Ag°-Af°  S 


_ t £ (2^)I°  \)  £ (2Ttk) 2 l'1 

| k*>-°°  1+X  | 2nk  | ' j k“-°»  1+X  1 2nk  | ^ | 


' <vr'd  \ 


and 


Or  r'kjkj i\  i ) 


e -It  I i -^r 

(k*-®  1+X|2rk|  ) (k«-°*  1+X|2nk|  j 


If  f 6 C one  can  see  thac 


r 


Re  * I f^Xx)  cos  2itkx  dx 


f* 


{f  ■*Xx)+f~(-x)}  cos  2irkx  dx 


(3.41) 


(3.42) 


II 


n 


; : 

(I 


and 


f 


2nk  Im  **  2irk  / f~(x)  sin  2irkx  dx 


= - Af2  + I f(3)(x)  cos  2nkx  dx 


- Af 


r 

•r» 


(f(3)(x)  + f(3)(-x)}  cos  2nkx  dx 


(3.43) 


i : 


with 


Af2  - f(2)(l)  - f(2)(0)  . 


From  (3.16)  it  follows  that  for  k - l,...,n-l 


L* 


'.1  afeulfc  L A du.  T i ida\iiAr&  I r Zm£kl*&A£iA  <r»  r"  < kJSHi 


J 


T 


* 

t 

4 


l - 


0h'  GuiitiS 


{_  _ ^bk+sn^  „(n) 


E g.  , -f.  . “ (E  Ag  -fit  ; :a — a 

k+sn  k+sn  1 k+sn  X+rk  ^ 


/r  ,1  Aflx  L lbk+snl  (n)l 

- (E  Ag  -Af  ) jbk+sn  - -V+rk"“  bk 


, /_  h X (n)  lbk+3n^  , 

’V  k^  ^+rk  m^+s>:'  k+sn  ’ 


m 


(3.44) 


I. 


Further,  if  f € C we  have 


“k 


Af 


2a,  - Af 3b,  + f,(4)b 


k k k 


(3.45) 


with 


Af2  - f(2)(l)  - f(2)(0)  , 
Af3  - f(3)(l)  - f<3)(0)  , 


r(4) 


r 


exp  (2irikt)f  ^ (t)dt  . 


The  last  term  on  the  right  of  (3.44)  can  then  be  rewritten  as 


v.  \(°)  ^ bk+sn I . 

| ak  k^  ^+rk  ak+sn  k+sn  | 


2 \ 

-Af3^(b2)(n)  ■’ W-  . b2  I 

|V°k'  T-1--  / 


X+r^  ‘'k+sn  j 


2 \ 
L(4)  2.(n)  K+sJ  f(4)  2 I 

+ |(fk  V X+r k fk+2nbk+snj  * 


(3.46) 


L'ft  Trout  iTto  m.  rtfrA 


sr.fl -«  r.  ...  +t  *«  j-frOEfi  I Mi  lisi*. 


/ 
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Let 


AQ(t) 


A^t) 


A2(t) 


A3(t) 


n-1 

y 

l/h2x(n) 

|°V 

lbk+sJ 

b2  1 . 
- *w  "p  (- 

■2ni(k+sn)t)  , 

Lj 

k“l 

x+rk 

n-1 

E 

k"l 

1/  , x(n)  ^bk+sJ 
|(akV  X+rk 

b I 

ak+sn  k+snj 

| exp  (-2ni(k+sn)t) 

n-1 

E 

k=l 

l„  . 

| Dk+sn 

lbk+sJ 

x+rk 

b5n  j exp  (~2xi(k+sn)t)  , 

I 

n-1 

E 

k-l 

1,  . 

jHc+sn 

lbk+sJ 

X+r, 

k 

exp  (-2.i(k+sn)t)  . 

1 

Set 


B.(t)  - X S -——^7 — exp  (-2nikt) , j=0,l,2,3  . 


^ k^O  X(2Ttk)‘,+l 

3 8 


Lemma  4.  If  X n X -*■  0 as  n-*-~  then 

I. 


/ 2rii\ 

| Aj (t)-Bj ( t) I 2 dt  - o\X  4 /,  j = 0, 1,2, 3 . 

2=11 

4 


(3.47) 


Also  /q  | Bj  (t)  | ^ dt  tends  to  zero  at  the  rate  of  X , j»0,l,2,3 


The  estimates  required  for  this  lemma  parallel  those  used  to  obtain 
(3.29)  and  (3.31). 

We  wish  to  get  more  convenient  representations  or  estimates  of  the 
Bj(t)'s.  A contour  integration  shows  that 
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:o(t)  " -h  / *-T  dx  = ~=  e“lt|2  i (c°s(t2_i)  + sln(|t|2’4)) 

J 1+x  2/2 


Successive  differentiation  then  indicates  that 


l(t)  2u  J 

I 
I 


eitxix  , 1 - I t 1 2-i  . 

£-  dx  ■ - y e 1 1 sin(t2  a)  , 

1+x 


C2(t>  “ 2tt 


tx  - — e"^2  i (sin(|t|2_i)  - cos  (t2~*))  , 

4 Zfi 


1+x 


C3(t)  = ^ / dx  - | sgn  t e"!^2  * cos  (t2_i) 

1 1+x4 


An  application  of  the  Poisson  sunrnation  formula  tells  us  that 


±i 


V‘>  . X 


Ec3 

u J 


((k-t)X_i)  . 


(3 


Only  the  terms  in  the  sum  (3.48)  corresponding  to  k *■  0,  k = 1 need  to 
be  considered  since  the  sum  of  the  remaining  terms  die  off  at  the  rate 
-aX-^ 

e”  with  a a positive  constant.  Notice  that  the  formulas  for  the 

Cj(t)  above  imply  that 


C,  - C,  - — 

1 3 2/2 


4 . 2 

Lemma  5,  Assume  that  f € C . Then  if  Af  ^ 0 


E Ag°-Af°  “ - X1/2Af2 


(3 


while  if  Af  =0 


.48) 


.49) 
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E Ag°-Af°  SJ 


2^2  X3/4  (f(3)(0) 


+ f(3)(l)}. 


If  f(2)(0)  + f(2)(l)  t 0 we  have 


E Ag3-Af 3 iJl  X1/4  | (f(2)(0)  + f C2> (1) } 


and  if  f(2)(0)  + f(2)(l)  - 0 


Ag3-Af 3 ^ 


Af3X1/2 


as  X - X(n)  0. 


The  asymptotic  relations  (3.49)  and  (3.50)  foil >w  from  (3.40), 
and  (3.45).  Formula  (3.51)  is  a consequence  of  (3.41)  and  (3.42). 
f(2)(0)  + f(2)(l)  = 0,  since  l Re  = - (f(2)(0)  + f(2)(l))  one 
see  that 


E 


l+X^nk)14 


X(2nk)4  Re 
1+X(2nk)4 


However  by  (3.45) 

Af3  1 / 1 ,r(4) , , . f(4)  „ xx  , . , 

Re  ci  = ^ [ - r-  (f  (x)  + f (-x))cos  2riicx  dx 

* ( 2 Ttk)  *■  (2TTk)  ^ J Z 

This  implies  (3.52), 

Lemma  6.  Let  f ? c\  If  f(2)(0)  * 0..  f(2)(l)  = 0 then 


I LI 

(3.50) 


(3.31) 


(3.52) 

(3.43) 

If 

can 


(3.53) 


(3.54) 
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! 


E g(t)-f(t)  - fvi/(0)  X 


- f(2)(0)  X^e-*2  X [sir.CtQ"^"^)  - cos(t2~*X“*)l 


+ e(t)  » 


0 < t < 1,  where  the  error  tern  e(t)  is  such  that 


e(t)2  dt  - o / [E  g(t)-f(t)]2  dt 


If  f(2)(0)  - f(2)(l)  - 0,  f(3)(0)  J*  0,  f(3)(l)  - 0,  we  have 


E g(t)-f(t)  - fVJ;(0)XJ/*  Pi 


(3),m,3/4  ^ -t2  7X 


Hk-4 


0 < t < 1,  where  the  error  term  again  satisfies  (3.56).  The  approximations 
appropriate  for  the  cases  f^(0)  - 0,  f^(l)  t*  0 and  f^(0)  - 
f^(l)  - 0,  f^(0)  » 0,  f^(l)  + 0 are  obtained  by  replacing  t by 
1-t  in  the  main  expressions  on  the  right  of  (3.55)  and  (3. 57)  respectively. 

We  next  consider  the  variance  and  bias  of  cne  derivative  g'  of  the 
smoothing  spline.  Theorems  2 and  3 follow  from  the  previous  analysis  of  g, 
after  noting  that  the  Fourier  coefficients  of  g'  are 


Sq  “ A g 


8k  " V8  - °khk.  k *0  . 


We  first  consider  the  integrated  squared  variance 
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From  (3.26), 


°2(V  3 T c2cI2x"1M 


As  in  (3.16) 


i 


3k+snAg  ak+sn\+sn  \+snAg  (* 


X+r, 


h h (nA 
k+-sn  k ) 


, (n).  0 ak+sn,k+sn  yk 

+ ak+snak  Ag  ‘ 


X+r, 


(3.60) 


Estimates  similar  to  those  used  in  the  analysis  of  the  smoothing  spline 

show  that  the  contribution  to  the  variance  from  the  first  term  is  of  order 

X ^n  The  second  term  gives  a contribution  of  order  X ^n  the  third 
term  dominates,  giving  a total  contribution  to  V 


Z V l2*ikl2 

n 4 2 

[X(2Ttik)>l] 


9- 


2 

q 

n 


(2rx)2 

((2tix)4+1)" 


dx 


(3.61) 


Next,  the  bias: 


E g 


k+sn 


k+sn 


*k+s„(E  ^ - 4£l> 

ak+sn^  Nc+sn  "hc+sn^ 


(3.62) 


which,  as  in  (3.44) 
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b a a(n) 

(E  Ag°-Af0)  

k 


(E  Ag1-Af 1)  j^c 

[- 


. Cn)  (n). 

k ®k  k+sn 


k+sn 


k+sn 


X+r, 


i 


k | X+rk  ~ °k+sn 


]■ 


(3.63) 


Making  approximations  as  in  the  analysis  of  the  spline  function  itself. 


E g'(t)-f'(t)  3?  (E  Ag°-Af°)  + (E  Ag°-Af°)  X"1  BQ(t) 

+ (E  Ag^Af1)  B3(L)  + A£2B2(t)  - Af3B1(t)  . 


(2) 

Using  the  Poisson-summation  approximation  and  Lemma  5 if  f (0)  j*  0, 
f(2)(l>  - 0, 


E g'(t)  - f'(t)  S*  f(2)(0)  2*  X*  e”U  cos  u 
where  u - 2_i  X-it.  If  f(2)(0)  » f(2)(l)  **  0,  and  f(3)(0)*0,  f(3)(l)-0 

E g"(t)  - f"(t)  - f(3)(0)  X^  e U (sin  u + cos  u)  . 

Note  that  the  approximate  (in  an  L2  sense)  bia9  of  the  derivative  is 
the  derivative  of  the  approximate  bias  (Lemma  6), 


i'1 

-,r 


1 V' 
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We  now  sketch  the  development  of  the  deconvolution  results.  Since 
this  parallels  closely  the  derivations  of  Section  3 the  presentation  will 
be  somewhat  sketchier.  As  before  let  g have  Fourier  coefficients 

gk  " Ag(>ak  ~ Agl\  + \bk  k * 0 (4.1) 


and  let 


*k  ” Wk  \ 

Bk  = wk  bk 
\ " wk  bk  \ 


and  define  as  in  Section  3.  Then  (1.7)  may  be  written  as 


2 


k-1 


h 


,(n) 


2 


+ X 


n-1 


(hg1)2  + 


J-l 


(4.2) 


Minimizing  the  0th  term  gives  h «*  0,  s ^ 0,  and  G„  + G~n^  » yn//n, 

Su  (J  Q J Q 

As  in  the  analysis  of  Section  3,  we  first  fix  Ag°  and 
minimize  with  respect  to  the  h^'s.  If 


Ag1  and 
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*1  . 0, (n)  . . lR(n) 

2,  * — ■ 1 - Ag  A.  + A g B,  . 


Then  (4.2)  becomes 


27  iv«J<n)i2  + l[W8l>2+?'?  'v 


sn 1 * Bk+sn 


The  minimizing  coefficients  can  be  calculated  to  be 


(4.3) 


h « B — 
j+on  J+sn  X+Pj 


(4.4) 


CO  0 1 

where  p - l |b  | . Now  to  calculate  the  minimizing  Ag  and  Ag  , 

^ gs- 00 

this  solution  is  substituted  back  into  (4.3)  to  give 


» 27  i;/  o^r1  w>2 . «-s> 

Minimizing  this  with  respect  to  Ag°  and  dg  amounts  to  solving  two 
linear  equations  in  two  unknowns,  and  it  may  be  seen  that  the  solution 
is  approximately 

Ag^  3*  | Re  Ajn^  <X+Pj) 

Ag1  «-  j Af1  + Re£'  | »<“>  |l  + E l^l2  ‘ 

(4.7) 


Y' 


, (n) 1 2 


(X+Pj) 


-li 


1-1 


(4.6) 
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We  next  consider  the  integrated  squared  variance,  which  is  the  sum  of 
the  variance  of  the  Fourier  coefficients  of  g.  Now,  from  above. 


g.,  Ag  a.j 
J+sn  j+sn 


0 |"a  _ iVsJ  WH-snAjn  1 

j+sn  X+Pj  J 

+ **  L . 

8 j+sn  X+Pj  J 


r 2 * 


X+Pj  ^ * 


(4.8) 


Via  approximationssimilar  to  those  in  Section  3,  it  may  be  seen  that  the 

— 1 —28/^2  8+4 ) 

first  two  terms  contribute  a net  variance  of  order  n X v ' whereas 

the  third  terra  contributes  the  dominating  variance,  which  is  of  order 
n-lx-(26+l)/(26+4)> 

if  we  write  the  Fourier  coefficients  of  f as 


fk  = Af\  " Af\  + ™k 


(4.9) 


,th 


and  take  expectations  in  (4.8),  the  bias  of  the  (j+sn)  Fourier  coefficient 
may  be  expressed  as 


E 8k+sn"fk+sn  “ (E  Ag 


- (E  Ag 


°-Af°)  |aR. 
1-4fl>  [' 


lbk+sJ  wk+snAk 
X+PL 


lbk+sJ  wk+snBk 


] 


2-  w(n) 


•VsJ  wk+snMk’ 


X+p,  *\+sn  k+sn 


(4.10) 
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As  in  section  3 for  |k|  S n/2,  k j1  0 


E gk-ffc  « (E  &g°-Af°)  X*k 


*+K| 


1 1 Xbk 

- (E  Ag-Af  ) 


X+|B. 


+ Af 


2 Xakbk 


- Af 


H\r 

3 K_ 

*+K  V 


Xf<*>b l 
k k 


x+K  I 


2 * 


(4.11) 


If  we  let 


D . (t)  - X 


** 


j k X+|bJ‘ 


exp  (-2nikt)  j»0,l,2,3 


(4.12) 


(note  that  then 


E g(t)-f(t)  ^ (E  Ag°-Af°)D3(t)  - (E  Ag1-Af1)D2(t) 


+ Af2D1(t)  - Af3DQ(t) 


f<4)  bk 
k k 


+ XT  k exp  (-2nikt)  . 

^ »+KI2 


(4.13) 
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The  function  Dj(t)  play  the  role  of  the  functions  B^(t)  of  section  3. 
Although  their  exact  analytical  form9  depend  on  w,  they  are,  like  the 


and  1 as  X 0. 


We  now  consider  the  individual  terms  in  (4.13).  From  (4.6)  it 
follows  that 


-1 


_ . 0 ..0  E mkBk\(x+pk)' 

E A g -Af  = -5 - 7~  . 

I l\l 


The  denominator  can  be  estimated  to  be  ~ ^3/(26+4)^  ^ jl  o,  the 

numerator  is 


aAf2EiB/u+Pkr1~*-1/<2B+‘>  • 


In  combination  with  D3  this  gives  a net  contribution  to  the  integrated 
squared  bias  which  is  ~ X ^ B \ If  Af  “0  the  numerator  is 
3?  (f^(l)  + f^(0))/2,  giving  a net  contribution  of  order  X7^26"H,\ 
If  f^(0)  » f^(l)  =0  k » 2,3  the  net  contribution  is  0(X2). 

Next , 


E Ag1-Af1  = 


Af1  + )'  ]2(X+pk) 

1 + T iBj  |2(X+pk)_1 


The  denominator  is  ~ x'1/(26+4)  and  if  f(2)(l)  or  f(2>(0W0  the 
numerator  is  S (f(2)(l)  + f(2)(0))/?.  This  gives  a net  contribution  to 


1 *1. 
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Abstract 

One  criterion  proposed  in  the  literature  for  selecting 
the  smoothing  parameter(s)  in  Rosenblatr-Parzen  nonparametric 
constant  kernel  estimators  of  a probability  density  function 
is  a lenve-out-one-at-a-time  nonparametric  maximum  likelihood 
method.  In  empirical  work  with  this  estimator  in  the  univariate 
case,  we  found  that  it  worked  quite  well  for  short-tailed 
distributions.  It  produced  estimators  which  differed  little 
from  those  produced  by  an  intuitively  appealing  maximum  likeli- 
hood method,  depending  on  a random  split  of  the  data,  which 
we  had  proposed  earlier  in  unpublished  work.  However,  both  of 
these  methods  drastically  oversmoothed  for  long-tailed  distri- 
butions. In  fact,  we  have  shown  that  these  nonparametric 
maximum  likelihood  methods  will  not  select  uniformly  consistent 
estimates  of  the  density  for  long-tailed  distributions  such  as 
the  double  exponential  or  the  Cauchy  distribution  when  the 
kernel  has  compact  support.  A remedy  we  founa  for  estimating 
long-tailed  distributions  was  to  apply  the  nonparametric 
maximum  likelihood  procedures  to  a variable  kernel  class  of 
estimators  considered  by  dreitnan  et  nl  (Techrometrics,  19. 

No.  2,  .lay  1077,  135-141). 

In  addition  to  constant  and  variable  kernel  estimators 
we  investigated  the  maximum  likelihood  criterion  applied  to 
a histogram  fanily  of  estimators  and  report  our  experience 
with  some  modifications  of  die  above  procedures. 

Our  experience  with  these  estiniaiors  includes  numerous 
univariate  case  studies.  This  paper  reports  on  the  methods 
as  applied  to  two  univariate  data  sets  of  one  hundred  samples 
(one  Cauchy,  one  normal).  Finally,  we  discuss  our  limited 
experience  in  the  multivariate  case. 


During  the  past  decade  there  has  been  much  work  in  the  area  of  non-para- 
metric  density  estimation.  Unfortunately,  most  of  the  results  have  been  of  the 
large  eample  type  and  little  guidance  exists  as  to  the  practical  implementation 
of  the  estimators  proposed.  This  is  the  primary  reason  why  these  estimation 
procedures  arc  used  extensively  by  only  a few  applied  statisticians.  One  cri- 
terion for  selecting  one  out  of  a family  of  non-parametric  density  estimators, 
which  has  been  mentioned  in  the  literature  is  the  maximum  likelihood  (ML)  cri- 
terion. Habbema  et  al.  (1974)  and  Duin  (1976)  mention  the  same  ML  procedure 
in  the  context  of  Rosenblatt-Parzen  kernel  type  estimation.  The  form  of  these 
estimators  is 

, n 

f(x)  **  f(x|0;  x.,...,  x ) « - Z K (x-x  ) 

1=1  (1) 

where  x^,...,  x^  is  a random  sample  from  the  density  f(x)  and  is  a density 

with  smoothing  parameter  0 (the  quantities  x,  x^  and  0 may  be  multidimensional). 

In  the  univariate  case 

K0(x)  - K(J)  f-  , Q>0  (2) 

where  K(*)  is  a fixed  density.  Choosing  O to  maximize  the  non-parametric  like- 
lihood 

n „ 

n f (x . | G 5 x,,...,  x ) 

i=l  1 1 n (3) 

is  useless;  (3)  is  unbounded  as  0 0.  To  avoid  this  degeneracy  problem, 

Habbema  at  al.  and  Duin  consider  replacing  f(x^|0;  x^,...,  x^)  in  (3)  by  the 
kernel  estimator  of  f(x^)  based  on  the  data  with  removed.  That  is,  they 
chose  0 to  maximize  the  criterion 

n . 

n f(xi|0;x1,...,  x^,  xi+l,...,  xn)  (4) 
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t,i  i .i-  j method  nay  La  applied  to  any  family  of  estimators  ^ased  on  some 

smoothing  parameter  9.  We  designate  by  ML1  this  leave-out-one-at-a-time  max- 
imum likelihood  procedure  as  a general  method  oi  choosing  0. 

Wahba  (1978)  refers  to  (4)  as  a "cross-validation  likelihood  function." 

We  think  that  term  better  describes  the  new  maximum  likelihood  method  which  we 

A 

now  propose.  Suppose  we  have  a family  of  estimators  {f(x|0;  x )}  for 

l n 

certain  values  of  a smoothing  parameter  0.  Let  x be  a subset  of  {x, ,...,  x } 

i in 

and  x 2 ■ {x^,...,  xn}-x^.  Denote  by  f(x}0^;  x^)  the  estimator  determined  by 
the  data  values  in  x^  using  the  smoothing  parameter  0^.  We  use  the  data  in 
x^(x2)  to  chose  62(01)  as  follows: 

0^  is  chosen  to  maximize 

n f(x|0i;  x1)  , i,j  c {1,2}  . (5) 

XenJ  t*i 

The  natural  density  estimator  based  on  the  data  split  (x^,  x2)  is 

f(x)  - n.f (x|0. , x.)  + n~f (x|0~,  x ) 

— i — (6) 

nl  + n2 

where  n^  is  the  number  of  elements  in  x..  In  Section  3 we  consider  this  es- 
timator for  "equal"  splits,  n^  ■ [n/2].  A permutation  invariant  estimator  of 
f(x)  is  the  average  of  estimators  (6)  over  all  equal  splits  of  the  data.  This 
estimator  is  computationally  not  feasable  for  moderate  n.  We  suggest  averaging 
(6)  over  several  random  splits  of  the  data.  In  our  experience  there  has  been 
little  change  in  the  estimator  after  averaging  over  only  a small  number  of  ran- 
dom splits  (one,  two  or  three).  We  designate  by  ML2  this  split-sample  procedure 
as  a general  method  for  chosing  0.  In  Section  3 a single  likelihood  value  is 
utilized  as  a measure  of  overall  performance,  for  the  ML1  method  the  single 
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« * 


n f(xt) 


(7) 


i-i 

A 

where  f is  given  by  (6). 

In  empirical  work  wich  these  estimators  we  found  that  they  both  worked 
quite  well  and  were  in  close  agreement  for  short  tailed  distributions.  However, 
both  methods  drastically  oversmoothed  for  long  tailed  distributions.  In 
section  2 we  discuss  the  nonconsistency  of  these  methods  when  using  kernels 
with  compact  support  to  estimate  densities  with  tails  as  long  as  the  double 
exponential  or  Cauchy  distributions.  A remedy  which  we  found  for  estimating 
these  long  tailed  distributions  was  to  apply  the  non-parametric  maximum 
likelihood  procedures  to  a variable  kernel  class  of  estimators  considered 
by  Breiman  et  al  (1977).  This  remedy  is  discussed  in  section  3 where  we 
analyze  two  univariate  data  sets,  one  from  the  standard  normal  and  one  from 
the  Cauchy. 

In  section  4 we  briefly  discuss  some  of  our  experience  in  the  multi- 
variate case  for  the  ML2  method.  Finally,  in  section  5 we  give  some  comments 
and  conclusions. 


2.  Nonconsistency  of  the  ML  Procedure. 

By  nonconsistency  we  mean  that  sup|f  (x)  - ffx)!/^  0 in  probability. 

x 

This  nonconsistency  will  be  demonstrated  for  a wide  class  of  densities  f 
and  kernels  k,  but  we  make  no  attempt  to  state  results  for  as  wide  a class 
as  possible.  For  the  sake  of  argument,  attention  is  placed  on  the  left 
toil  of  the  distribution  and  we  consider  only  the  ML1  estimators. 

Let  F denote  the  cdf  of  the  density  f and  let  h(u)  « u/fF  ^(u). 
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0 < u < 1.  He  assume 


h Is  continuous  and  lim  +h(u)  - h(0+)  exists,  possible  infinite. 

u - O (8) 


We  say  f has  a long  left  tail  if  h(0+)  > 0.  Assume  for  the  present 
only  that  k has  finite  support.  Without  Iocs  of  generality  we  suppose 
the  support  is  contained  in  [-1,1],  i.e. 


k(u)  - 0 if  1 u I > 1. 


A basic  observation  concerning  the  smoothing  parameter  0*0  which 

n 

maximizes  (4)  is  that  for  each  x , |x  - x | <9  for  some  x 

i i J*  - n j 

wlth  j i*  i.  In  particular  for  the  ML1  estimator 


x_  - x,  <0 
2n  In  — n , 


where  x,  < x„  < • • • < 
In  2n 


Xnn  arc  tl'c  ordor  statistics  of  the  sample. 


Let  uln  - F(xln),  a.  Then  x,n  - x^  - F-1^)  - K'^u^)  - 

h(un)(u2n  " uln)/un  * whcrc  uln  - un  - u2n  * Frora  <10)  ic  Allows 


h(Un)(u2n  - uln)/u2n  1 6n  * 


Uoing  uniformity  of  (u2n  ~ Uj^J/u.,  and  standard  arguments  (H) 


14/ 


P(  On  < be  ) < e + P(  h(un)  < b ),  b,c  > 0. 


Lemma  1.  Under  (8)  and  (9)  , h(0  ) > 0 implies  0^  A O.  Furthermore 


h(0  ) - «°  implies  0n  . 


Proof:  Choose  0 < b < h(0  ) in  (12). 


P P 

Lemma  2.  If  0 " and  sup|k(u)|  < 00  then  sup  f (x)  -*•  0. 


Since  k is  hounded  the  proof  follows  from  (l). 

Now  Lemmas  1 and  2 combine  to  give  the  nonconsistency  result 
for  distributions  like  the  Cauchy  whore  h(0+)  « °°.  There  is  no 
difficulty  here;  the  density  estimate  flattens  out  entirely.  It  is 
more  difficult  to  establish  the  nonconsistency  for  boundary  cases  where 
0 < h(0+)  < «°.  The  double  exponential  density  is  one  of  these  and  is 
Covered  by  the  following  lemma.  In  addition  to  (0)  we  will  assume  that 
the  kernel  k is  left  continuous  and  of  bounded  variation  on  (-00  ,00). 


Lemma  3.  Let  C maximize  L (Q)  of  (A)  for  each  n.  Suppose  f is 
— n n 

unimodal  and  h(0+)  » a where  0 < a < 00 . Then  sup|f  (x)  - f(x)|  f*  0 


in  probability. 

The  proof  can  be  found  In  Schuster  and  Gregory  (1081). 
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The  following  table  gives  the  left  tail  behavior  of  sons  common 
distributions. 


distribution 

density 

lio  u/fF  *(u) 
u -K) 

Normal 

1 

o/ln  exp  {-  2 ( 0 ) } 

0 

Double  exponential 

(j)  exp  f-A  | x — u| } 

1/A 

Cauchy 

(J)  (a2  + (x  - u)2}-1 

00 

Finite  support 

0 

3.  Two  Univariate  Data  Sets  Analyzed. 

Two  pseudo-randomly  generated  data  sets,  one  from  a Cauchy  distribution 

and  another  from  a normal  distribution,  are  investigated  in  this  section. 

General  implications  are  summarized  in  Section  5. 

We  first  consider  the  Cauchy  example.  Table  1 shows  n»100  order 

statistics  of  a pseudo-random  sample  from  a standard  Cauchy  distribution, 

-1  2 -1 

f(x)  - it  (1+x  ) . The  asterisks  (*)  indicate  a division  into  two  sub- 

samples to  be  discussed  later. 

We  consider  two  types  of  kernel  estimators,  the  constant  kernel  type  - 
given  by  equations  (1)  and  (2)  where  we  write  0 - (3),  and  the  variable 


f(x)  « f (x| G;  x 


1 n 
- ~ E 

ni-l 


1 
(nd 


V 


x-x. 


lk 


>_1  k r \ 


(13) 


where  0 - (k,a)  , k c (i,...,  n} 
a > 0 


i 

i 


onr>r  t<“5 
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and  d..  Is  the  kth  nearest  neighbor  to  in  the  sample  (x, .....  x }.  For 
ik  — ^ in 

the  analysis  we  chose  a kernel  K similar  to  the  standard  normal  but  one  in- 
volving less  computing  time; 

K is  the  t(29)  density. 

In  our  experience  the  choice  of  the  kernel  among  those  with  infinite  support, 
seemed  to  matter  little.  However,  for  long-tailed  distributions  (such  a3  our 
present  example)  kernels  with  finite  support  perform  poorly. 

Consider  first  the  method  ML1  for  these  two  types  of  kernel  estimators. 

We  consider  the  types  together  as  one  family  and  let  the  maximum  of  the  like- 
lihood (4)  choose  between  them.  Notice  that  for  the  variable  kernel  estimators 
the  maximum  of  (4)  is  sought  over  a two-dimensional  space  {(k,a)}.  The  range 
of  a (as  well  as  a)  used  for  our  likelihood  calculations  is  .l(.l)5.5(ie.  from 
.1  to  55  in  steps  of  .1)  and  that  for  k is  15(5)45.  The  constant  kernel  es- 
timate picked  by  the  ML1  method  is  useless,  being  much  Coo  flat  (oversmoothed). 

In  fact  the  estimate  has  a maximum  of  only  .15  and  possesses  extremely  long 
tails.  In  the  combined  family  the  ML1  method  picked  the  variable  kernel  es- 
timator with  k=30  and  a”2.6.  Figure  1 shows  this  estimator,  as  well  as  others, 
superimposed  over  a graph  of  the  theoretical  Cauchy  density.  Breiman  et  al. 
(1977),  page  136,  consider  three  error  measures,  percent  variance  not  explained 
(PVNE),  mean  absolute  error  (MAE),  and  mean  percent  error  (MPE) , for  comparing 
as  estimated  density  f to  a theoretical  density  f (in  this  case  the  Cauchy  density 


which  was  the  model  for  the  pseudo-random  samples).  The  error  measures  are  de- 


fined as 


4-2  “ I (f(xt)  - f(x±))2  x 100 

1 


MAE  = x l I f (x±)  - f(xt)|  x 100,  and 
npf  1 


1 J |f(xA)  - f(xt)| 


x 100 


f (Ki) 


where  pf  - ^ I f(x^  , and  ^ S (f(xj)  - pf)2  . 

In  Figure  2 these  are  plotted  as  a function  of  a for  the  case  k"3Q,  chosen  by 
the  ML1  method.  Superimposed  on  the  graph  la  a plot  of  transformed  likelihood 
values  (4),  plotted  against  a for  the  case  k=30.  The  particular  transformation 
plotted,  (-log  (expression(4) } -240)/2,  is  of  no  significance;  the  only  intent 
was  to  bring  the  values  into  the  range  of  the  error  percentages.  It  is  seon  that 
the  ML1  method  chooses  a value  of  a (2.6)  which  is  near  the  minimizing  value 
for  each  error  measure.  Note  that  since  all  error  measures  in(14)  depend  on  the 
unknown  density,  they  could  not  be  used  in  selecting  the  smoothing  parameters. 

Breiman  at  al.  (1977),  page  140,  used  a goodness-of-f it  criterion  to  choose 
the  smoothing  parameters  of  kernel  estimators  in  fitting  two  bivariate  data  sets. 
We  investigated  this  goodness-of-fit  criterion  for  the  present  univariate  data 
set.  The  one  dimensional  value  of  V(r)  in  the  Breiman  paper  is  2r.  With  V(r)“ 

2r  the  goodness-of-fit  criterion  did  not  work;  over  a range  of  values  k,  the 
likelihood  values  were  increasing  at  a = .1  as  a decreased.  wor  a this  small, 
the  estimators  were  already  too  rough.  It  seemed  to  us  that  perhaps  in  one  dimen- 
sion one  should  use  V(r)-r.  However,  this  change  gave  no  better  results. 

Consider  now  the  method  ML 2 applied  to  the  constant  and  variable  kernel 
estimators.  The  application  of  the  method  applied  to  the  constant  kernel  es- 
timators is  straightforward.  However,  for  the  variable  kernel  case,  the  strict 
application  of  the  method  might  lead  to  an  estimator  which  is  a mixture  of  two 
with  different  values  of  k,  which  we  view  as  undesirable.  We  make  a distinction 
between  the  parameters  k and  a similar  to  that  made  by  Wahba  (1978)  in  another 
setting:  a is  the  primary  smoothing  parameter  and  k is  a secondary  shape  para- 
meter. The  value  of  k may  be  chosen  first  as  follows,  (i)  Choose  at  random  a 
partition  (n^,  iTj)*  (H)  For  each  of  several  values  of  k calculate  a value  for 
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the  overall  criterion  (7)  where  the  ML2  method  haa  been  applied  to  the  smo- 
othing parameter  a only,  (iii)  Choose  the  value  k which  maximizes  (7).  We 
repeated  the  above  procedure  over  three  random  partitions  for  the  values  k“9(3)21. 
In  each  case  the  value  k ■ 15  was  selected.  Then  with  k selected  at  15  we  chose 
a based  on  a new  random  split.  The  asterisks  in  Table  1 indicate  the  resulting 
split.  Say  that  a value  is  in  if  it  has  an  asterisk  and  in  otherwise. 
Searching  over  a - .1(.1)5.5  the  KL2  method  chose  » 2.7  and  a9  ■ 2.5.  The 

resulting  estimator  is  shown  in  Figure  1 and  is  very  close  to  the  estimator 
chosen  by  the  MLl  method.  Notice  that  the  value  of  k chosen  by  the  ML1  method 
is  30%  of  the  total  sample  size  and  that  the  k chosen  by  the  ML2  method  is  30% 
of  the  size  of  each  split  sample. 

We  also  considered  a histogram  estimator  from  Van  Ryzin  (1973), 
f (x  j 0 , Xj ....  “ 


n(x(j+0)"  x(j)) 


if  x(j)  < X < x(J+e) 

j - 1,  0+1,  20+1,  ...,  r 


0 if  x < x^  or  x i x(r) 

where  r - -]6  + 1, 

with  (•]  the  largest  Integer  function,  and  x^  <_  ...  £ x^  the  ordered  sample. 
Now  0 is  an  integer  valued  smoothing  parameter.  We  applied  the  ML 2 procedure 


to  this  estimator  with  the  following  modification.  Since  at  least  one  of  the 


quantities  in  (5)  would  be  identically  zero  due  to  the  finite  support  of  f,  we 
modified  (5)  in  this  case  so  that  only  those  x's  for  which  f i1  0 entered  into 


the  product.  Averaging  (6)  over  several  random  splits  has  a smoothing  effect 
on  these  estinators.  An  estimate  averaged  over  five  random  splits  appears  in 
Figure  1.  The  computation  time  required  to  generate  the  histogram  estimate  was 
very  small  when  compared  to  the  kernel  estimates. 

A similar  analysis  was  carried  out  on  100  pseudo-random  samples  from  the 

_x2 /2  __ 

standard  normal  density  f(x)  **  c / /2ti  . The  sample  values  appear  in 

Table  2 and  the  density  estimates  appear  in  Figure  3.  In  applying  the  MLl  method 
we  used  the  ranges  o and  a m .100(.015) .910  and  k » 5(5)85.  The  ML1  method 
picked  the  constant  kernel  estimate  with  a **  .460  but  only  barely  so,  over  the 
case  with  k ■ 15.  The  constant  kernel  estimate  is  smooth  (see  Figure  3)  and 
quite  satisfactory  while  the  estimate  with  k = 15  is  very  rough  near  the  center. 
This  pattern  persisted  in  other  examples  we  investigated;  indeed  the  estimate 
corresponding  to  small  k was  often  chosen  over  the  constant  kernel  estimate.  It 
is  seen  that  the  MLl  method,  which  worked  well  for  long-tailed  Cauchy  data  sets, 
has  instability  at  small  k for  the  short-tailed  normal  distribution.  The  error 
measures  (14)  are  graphed  in  Figure  4.  The  transformation  of  (4)  which  is  super- 
imposed is  10(-log  {expression  (4)}  -135).  The  MLl  method  worked  very  well  in 
picking  the  smoothing  parameter  o of  the  constant  kernel  estimator  close  to  the 
minimizing  value  for  each  error  measure. 

As  described  previously  for  the  Cauchy  data  we  first  choose  k for  the  ML2 
method  based  on  values  (7)  and  several  random  splits.  The  ranges  chosen  were 
k “ 5(5)45  and  o and  a =>  .100(.015) .910.  Instability  was  noted  here  also  in  the 
choice  of  k,  different  random  splits  indicating  in  turn  the  constant  kernel  es- 
timator and  variable  kernel  estimators  with  different  k values.  The  ML2  method 
seemed  to  guard  against  the  choice  of  an  extremely  small  k better  than  the  MLl 
method.  The  use  of  repeated  random  splits,  which  at  first  glance  is  a drawback 


of  the  ML2  method*  gave  the  following  useful  observation.  For  each  random 
spile  the  constant  kernel  estimate  gave  a value  for  the  logarithm  of  (7)  close 
to  the  maximum.  In  fact  averaging  the  logarithms  of  the  likelihood  over  four 
random  splits  showed  the  constant  kernel  estimator  to  be  the  best.  Based  on 
this  the  constant  kernel  form  was  chosen;  then  three  additional  random  splits 
were  used  to  give  three  estimates  of  the  form  (6)  whose  average  appears  in 
Figure  3.  We  mention  in  passing  that  the  logarithms  of  (7)  used  in  making  the 
choices  among  variable  and  constant  kernel  forms  were  often  very  close  together* 
differing  sometimes  only  in  the  fifth  significant  digit.  To  check  for  round- 
off inaccuracies  we  reprogrammed  all  calculations  in  double  precision  but  none 
of  the  selections  was  changed. 

The  ML2  method  was  applied  to  the  histogram  family  (15).  The  average  of 
25  estimates  of  the  form  (6)  appears  in  Figure  3. 

We  checked  the  goodness-of-fit  criterion,  used  by  Breiman  et  al.  (1977), 
on  the  normal  data  set.  The  same  results  occurred  here  as  reported  for  the 
Cauchy  data  set. 

Since  the  problem  with  the  estimation  of  long  tailed  densities  was  in  the 
oversnoothing  caused  by  the  extreme  oDservations,  we  trirrcaed  observations  and 
considered  the  natural  modified  empirical  likelihood  representing  an  estimate 
of  the  joint  density  of  the  order  statistics  X(L+1)  throuE*1  X(n_L)  * T,le 

smoothing  parameter  chosen  initially  decreased  drastically  with  increasing  L 
for  long  tailed  distributions  and  the  estimate  of  the  Cauchy  density  continued 
to  improve  as  I.  Increased.  However,  since  the  maximizing  8n  was  nearly 

decreasing  as  L increased  we  are  not  able  to  give  any  guidance  as  to  how 
r any  observations  to  trim. 
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4.  Multivariate  Cage. 

In  the  immediate  generalization  of  the  univariate  maximum  likelihood 
procedure  to  the  multivariate  case  one  would  need  to  choose  a shape  factor 
for  each  coordinate.  For  simplicity  we  restrict  our»elves  to  the  bivariice 
case  where  the  bivariate  kernel  estimator  based  on  the  random  sample 
(xl*yi),‘'*,(xn’yn)  *’rom  a bivariate  density  is 

-1  n 

f (x,y;a,b)  - (nab)  L k{ (x-x  )/a, (y-y . )/b} 
n i-1  1 1 

where  k is  a bivariate  density.  £ common  oversimplification  in  case 

studies  of  applications  of  bivariate  (and  multivariate)  kernel  estimators 

has  been  to  take  the  same  shape  factor  for  each  coordinate.  Our  empirical 

work  has  been  limited  to  the  split  sample  K1.2  method.  We  again  ascume  c is 

even,  say  n**2m.  Our  bivariate  procedure  randomly  splits  the  bivariate 

data  into  two  groups  of  ordered  pairs,  say  the  first  and  the  Inst  m 

pairs/  which  we  refer  to  as  the  x*3  and  the  y’s.  The  x's  would  be  used 

in  defining  the  functional  form  of  the  kernel  estimator  f of  f and  the 

m 

y's  would  be  used  to  find  the  shape  factors  (a^,b^)  which  maximizes  the 

"empirical  probability"  of  observing  the  y's  (as  in  the  univariate  case), 

Tn  the  same  fashion  a second  estimator  would  be  constructed  which  uses  tut 

y's  in  defining  the  functional  form  of  f and  the  x's  would  be  used  to  find 

the  shape  factor.,  (a^.b^)  which  maximize  the  "empirical  probability"  of 

observing  the  x's.  The  resulting  two  estimators,  say  f , (*,*;a, ,b,)  ant 

m,l  11 

f _ (• , • ;a.,b_) , are  then  awraged  to  obtain  the  bivariate  estimator  f 

2 2 r. 


of  f . 


> «« 


of  solving  the  bivariate  aaximixntion  problem.  As  a good  initial  guess  we 

used  what  we  refer  to  as  the  marginal  solutions.  Basically,  the  marginal 

solution  consists  of  finding  the  shape  parameters  which  work  "best"  for 

each  coordinate.  To  be  more  specific,  if  our  data  pairs  were  (x^,y^),..., 

(x  ,y  ),  where  n*2ra,  we  use  (x,,...,x  ) as  in  section  3 to  obtain  two 
n n in 

estimators  f ,(x;£)  and  f _(x;a*)  of  the  common  density  of  x, ,...,x  . 

m,l  m,x  la 

Similarly  we  use  (y^,...,y^)  as  in  section  3 to  obtain  two  estimators 

h^,  (y;b)  and  h^  common  density  of  y^,...,yn<  The  first 

approximation  to  (a^,b^)  was  (fi,6)  and  the  first  approximation  to 

(a^.b^)  was  (a*,b*).  Of  course,  one  could  just  estimate  the  bivariate 

f by  averaging  f ,(•»•, a, b)  and  f _(• , * ,a*,b*) . He  call  this  our 
m,l  m.z 

marginal  solution.  Although  somewhat  more  irregular  in  the  bivariate 
normal  cases  we  have  studied,  this  marginal  solution  is  less  time-consuming 
to  compute  and  seems  to  be  adequate  for  many  purposes.  In  figures  5-10  we 
picture  the  actual  density,  the  marginal  estimator,  and  the  nonparametric 
maximum  likelihood  estimator  for  two  case  studies  of  samples  of  400  from 
bivariate  normal  densities.  In  case  1 ve  are  sampling  from  the  bivariate 
normal  density  with  mean  vector  0 and  unit  covariance  pictured  in  figure 
5.  Figure  7 gives  the  marginal  estimators  for  this  case  and  figure  8 gives 
the  nonparametric  maximum  likelihood  estimator.  The  second  case  study  is 
a sample  of  400  from  the  bivariate  normal  density  0 mean,  and  covariance 


matrix 


pictured  in  figure  6 . Figures  9 and  10  give  the  marginal 


and  nonparametric  maximum  likelihood  estimators  for  this  case.  The  bivari- 
ate kernel  u ed  was  a product  of  standard  normal  densities. 


130 


The  results  reported  here  are  only  a small  part  of  our  experience  In  ap- 
plying non-parametrlc  maximum  likelihood  techniques.  However,  our  multivariate 
experience  with  simulated  data  has  been  limited  to  the  split  sample  ML2 
method.  Below  is  a list  of  observations  and  recommendations. 

a.  We  find  the  ML2  data-splitting  method  to  be  very  attractive,  both 
conceptually  and  in  application.  There  is  some  instability  in  the 
choice  of  k for  variable  kernel  estimators  of  short  tailed  densities 
but  we  Judge  it  to  be  leas  than  for  the  MI.1  estimators.  The  answer 
here  might  be  to  have  a preliminary  classification  of  the  density  ns 
long  or  short  tailed  and  then  to  suitably  restrict  the  k values  con- 
sidered. For  short  tailed  densities  only  lnige  k values  and  the  con- 
stant r.crnel  case  would  be  considered.  A drawback  of  the  ML2  method 
for  kernel  estimation  n the  considerable  computation  time  involved. 
The  ML]  method  offers  onlv  moderate  improvement  in  computation  time. 
We  feel  that  the  randomized  nature  oi  the  ML2  method  may  prove  very 
useful  in  future  work  in  guarding  against  bad  estimates  both  of  den- 
sities and  functionals  of  densities.  An  attempt  was  made  to  remove 
this  randomized  component  by  dividing  the  sample  into  the  even  and 
odd  statistics.  Tills  approach  failed;  the  density  estimates  were  too 
rough. 

b.  The  ML1  method  has  as  noted  above  instability  in  the  estimation  of 
short  tailed  densities.  However,  in  cases  where  this  was  not  a pro- 
blem the  Ml. I and  ML2  methods  tended  to  agtee  closely  and  for  constant 
kernel  estimation,  to  coincide  almost  exactly.  We  view  this  as  Jus- 
tification of  the  ML1  procedure  which  on  the  surface  does  not  Impart 
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c.  The  histogram  estimators  are  disappointing  in  their  lack  of  smoothness 
bat  wit’.in  that  class  of  estimators  we  judge  that  the  maximum  likeli- 
hood method  worked  well.  Computation  time  was  fast.  Perhaps  future 
work  will  develop  a "quick  and  dirty"  way  of  using  a histogram  es- 
timate as  a preliminary  in  choosing  the  smoothing  parameters  of  kernel 
estimators. 

d.  We  have  determined  that  maximum  likelihood  techniques  are  also  ap- 
plicable to  multidimensional  density  estimation.  In  multivariate 
kernel  estimation  with  a product  kernel  a marginal  distribution  tech- 
nique is  to  choose  the  smoothing  parameter  associated  with  each  vari- 
able by  considering  only  the  univariate  marginal  distribution  of  each 
variable.  This  is  to  be  contrasted  with  a multidimensional  search  of 
the  likelihood  surface.  In  the  multivariate  case  computation  time 

is  very  important.  In  this  direction  our  empirical  work  was  limited 
to  several  multivariate  normal  data  sets  using  the  KL2  constant  kernel 
method.  Although  the  estimators  were  quite  reasonable  there  was  some 
tendency  to  oversmooth. 

e.  The  use  of  maximum  likelihood  techniques  on  families  of  estimators 
other  than  those  considered  here,  should  be  investigated.  In  particular 
this  includes  the  orthogonal  series  estimators  of  Wahba  (1978). 

Acknowledgement . The  work  reported  in  this  paper  is  primarily  a summary 


of  results  in  the  papers  by  Schuster  and  Gregory  (1978,  1979,  1981). 


Table  1.  Table  2, 

i ‘ Cauchy  Data  Nortaal  Data 


t 

-2.313 

-2.232 

-1.957 

-1.641 

-1.595 

-1  529 

-1.429 

-1.424 

* 

-23  ;06* 

-17*511* 

-11.159 

-10  315 

-1.157 

-1.145 

-1  129 

-1.065 

-3  7y.* 

-4.828* 

-3  836 

-3.319* 

-1.039 

-1  018 

-0.956 

-0.908 

* 

-2  5/6* 

-2  750* 

-2  449 

-2.417* 

-0.905 

-0  835 

-0  861 

-0.851 

363* 

-2. 31* 

-2.114* 

-2  058* 

-0  814 

-0.757 

-0.692 

-0  690 

-1  621* 

-1  543* 

-1.401 

-0.647 

-0  631 

-0.621 

-0.611 

-1  .#? 

-1  236* 

-1.178* 

-1  083 

-0.598 

-0  536 

-C. 532 

-0.502 

- 4 4* 

-1  032* 

-1.0/3 

-0  987* 

-0.491 

-0  465 

-0  447 

-0.432 

t'  * < 

-0  - 4* 

-0  756 

-0  525* 

-0  480 

-0  421 

-C  420 

-0.384 

-0.361 

' , 

-3  .62 

-0  455 

-0  454 

-0  434* 

-0.297 

-0  259 

-0.244 

-0.197 

f H 

-0  429* 

-0.421* 

-0  328 

-0  318 

-0.173 

-0  112 

-0  090 

-0.C47 

-0  2-0* 

-0  265 

-0  212* 

-0.165 

-0  031 

-0  029 

-0  015 

-0.00*4 

-0  157* 

-0  006 

0 051* 

0 056* 

-0  001 

0 014 

0.094 

0.J16 

0 063* 

0 084* 

0 116* 

0 117 

0.134 

0 146 

0.163 

0.166 

' 

0 157 

0 186* 

0 226* 

0 253 

0 167 

0 178 

0.222 

0 242 

0 266* 

0 284 

0 337 

0.34  3 

0 263 

0.278 

0 326 

0.366 

0 'arJ* 

0 381* 

0 402 

0.444 

0 380 

0 409 

C 416 

0.423 

0 ;.-* 

0 5?0 

0 60  > 
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Summary:  In  this  paper,  we  review  some  recent  results  for  choosing 

smoothing  parameters  for  some  bivariate  density  estimators. 
Some  univariate  results  are  reviewed  as  well. 
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Review  of  Some  Results  in  Bivariate  Density  Estimation 


1.  Introduction 

For  representing  and  examining  data  in  up  to  several  dimensions, 
nonparametric  density  fstiraation  provides  an  analytic  tool  that  is 
simultaneously  exploratory  and  confirmatory.  Unusual  data  features  that 
may  be  discovered  or  explored  include  multiple  inodes  or  clusters,  as 
.all  as  unusual  isolated  points.  At  the  same  time,  nonparametric  density 
estimators  are  confirmatory  slice  they  provide  a consistent  estimator  of 
the  true  underlying  sample  density  function  under  mild  restrictions. 

The  family  of  nonparametric  density  estimators  Is  diverse,  including 
the  histogram,  frequency  polygon,  kernel  estimator,  series  estimator,  and 
penalized-likelihood  estimators  to  name  a few  important  choices.  Each  of 
these  methods  has  one  or  more  calibration  or  deisgn  parameters  coranonly 
referred  to  as  smoothing  parameters.  The  bin  width  for  an  equally-spaced 
histogram  plays  the  role  of  the  smoothing  parameter;  too  wide  a bin  width 
gives  an  oversmmothed  estimate  while  too  narrow  a bin  width  results  in  an 
undersrooothed  or  rough-looking  figure.  In  the  terminology  of  Tukey's 
exploratory  data  analysis,  in  the  first  case  we  see  too  much  of  the 
forest  (i.e.,  the  smooth)  and  m the  latter  case  we  see  too  many  trees  in 
the  forest  (coo  rough). 

Much  theoretical  and  some  practical  work  has  appeared  on  how  to 
choose  the  smoothing  parameter  to  provide  the  best  approximation  to  the 
underlying  density  function.  It  is  also  the  case  that  the  smoothing 
parameter  has  a certain  exploratory  nature,  where  we  dynamically  adjust 
how  much  forest  and  how  many  trees  we  wirh  to  see. 
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2.  Univariate  Density  Estimation 

2.1  Some  Graphical  Interactive  Approaches 

In  recent  years  there  have  appeared  some  interesting  algorithms  that 
automatical 1y  pick  a smoothing  parameter  appropriate  for  a given  data  set 
for  a particular  nonparametric  density  estimation  procedure.  Prior  to 
the  evolution  of  these  algorithms,  statisticians  learned  how  to  pick  good 
smoothing  parameters  through  simulation  experiments  and  interactive 
graphical  methods.  These  latter  methods  will  be  important  to  use  even 
with  the  automatic  methods  for  validation  purposes,  data  exploration 
purposes,  and  in  cases  where  the  automatic  methods  return  occasionally 
bad  smooth  i.g  parameter  values. 

We  can  illustrate  several  graphical  methods,  some  known,  some  not 
with  a nonparametric  kernel  density  estimator 


f(x) 


(1) 


The  first  method  is  to  pick  a decreasing  sequence  of  smoothing  parameters 
(h's)  and  look  at  the  corresponding  sequence  of  density  estimates.  For 
some  simulated  Gaussian  mixture  data  with  n = 300,  we  show  a sequence  of 
estimates  in  Figures  1(A)-(C).  It  is  important  to  start  with  obviously 
oversmoothed  estimates  (large  h)  and  look  at  the  resulting  sequence  of 
estimates  that  shows  increasing  fidelity  to  the  data  and  then  finally 
becomes  contaminated  with  noisy  fluctuations. 

The  proceeding  interactive  approach  is  not  very  sensitive  for 
discriminating  among  several  apparently  "good"  estimate-pictures.  A 
similar  problem  exists  in  curve  fitting.  Tukey  points  out  that  plots  of 
the  residuals  3 data  - fit  provide  a greatly  enhanced  ability  to  compare 
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dciei  ijouiticalion  is  lo  examine  plots  oi  i'*  ratner  than  of  f itself. 

The  second  derivative  is  much  more  sensitive  to  small  changes  in  h than 
the  density  function  itself.  This  is  the  procedure  advocated  by 
Silverman  (1978),  which  results  in  pictures  he  calls  "test  graphs." 

With  a little  more  experience  and  experimentation,  wc  can  go  through  a 
sequence  of  test  graphs  and  accept  a test  graph  with  an  desired  amount 
of  noisy  fluctuations.  In  Figure  2,  we  reproduce  three  test  graphs 
presented  by  Silverman. 

A third  procedure  provides  a useful  shortcut  and  sometimes  welcome 
relief  to  the  previous  methods.  A possible  choice  of  a measure  of  the 
roughness  contained  in  a univariate  test  graph  for  a Gaussian  kernel  is 

.(Xi-x^2 

J*  f"(x)2dx  = — j~T  EE  ( h4  - (xi-x  (x^x  )4]  e 4h 

8n  h’rr  ij  J J 

We  simply  plot  the  logarithm  of  equation  (2)  as  a function  of  h.  The 
graph  has  slope  near  zero  for  values  of  h corresponding  to  moderate 
oversmootnng  and  very  large  slope  for  values  of  h corresponding  to  rough 
estimates  approaching  Dirac  spikes  at  the  sample  points.  In  Figure  3 we 
show  six  examples  for  various  simulated  data  sets  of  this  so-called  "h-rough" 
plot.  Also  shotjn  m each  h-rough  graph  is  a point  labelled  "best  h." 

This  is  the  particular  choice  of  h for  that  sample  that  minimized  the 
integrated  squared  error  (ISE)  between  the  sampling  density  f and  the 
estimate  f and  is  given  by 

ISE  = J*  (f(x)-f (x))2  dx. 

It  is  clear  that  good  choices  of  h lie  in  the  region  where  the  slope  of 
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Other  viseful  approaches  are  based  on  rules  of  thumb  derived  from 
asymptotic  theoretical  results.  For  example,  Scott  (1979)  proved  that 
the  optimal  bin  width  h for  an  equally  spaced  histogram  density  estimator 
is  given  by 


t 


J'f  (x)  dx 


,1/3  -1/3 

] n 


(3) 


<4) 


The  rule  of  thumb  he  proposed  was  to  choose 

h*«  3.5  s n~1/3  , 
x ’ 

a formula  based  on  using  equation  (3)  and  data  moment  estimators 

2 

assuming  the  sampling  data  is  N(p,o  )•  He  also  provided  multiplicative 
correction  factors  based  on  higher  order  sample  moments  such  as  the 
skewness.  In  Figure  4,  we  show  3 histograms  of  the  same  simulated  Normal 
data  with  n ■ 1000.  These  figures  also  illustrate  the  usefulness  of  the 
integrated  squared  error  criterion  upon  which  equation  (3)  is  based. 

Also  notice  how  the  sequential  interactive  approach  works  well  here. 

One  automatic  method  for  picking  a kernel  smoothing  parameter  is 
called  the  "quasi-optimal"  procedure  (Scott,  1976).  It  is  based  upon  the 
well-known  theoretically  optimal  choice  for 
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h - h - e(o  = p( ; f(xr  dx ) 


(5) 
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For  a particular  choice  of  h,  we  have  a ready  estimate  for  the  right-hand 
side  of  equation  (5)  using  equation  (2)  for  a Gaussian  kernel.  The 
quasi-optimal  smoothing  parameter  is  the  largest  stationary  point  of 
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the  right-  and  left-hand  aides  of  equation  (5)  as  & function  of  h. 

Stationary  points  are  marked  by  arrows  and  occur  when  the  lines  intersect. 

This  and  several  other  automatic  procedures  have  recently  been  compared 
by  Monte  Carlo  methods  (Scott  and  Factor,  1981). 

2.2  Other  Univariate  Procedures 

A new  density  estimator  was  proposed  (Scott,  Tapia  & Thompson,  1979) 
based  on  the  maximum  penalized-likelihood  criterion: 

In  L(f)  = X ln[f(x. )]  - « J*  f"(t)2  dt  . (6) 

i 1 

If  we  optimize  (6)  over  the  class  of  continuous  pieccwise-linear 

functions  defined  on  a given  mesh  we  obtain  the  DMPLE  - the  discrete 

maximum  penalized-likelihood  estimator,  a code  for  which  exists  in  the  IMSL 

library  (1982,  NDMPLE).  Here  a,  the  penalty  or  roughness  weight,  plays 

the  role  of  the  smootning  parameter.  While  consistency  of  the  DMPLE  is 

well-known,  we  have  tew  theoretical  results  on  actual  convergence  behavior. 

Extensive  numerical  simulations  indicate  that  the  rate  of  convergence  is 
-4/5 

n , the  same  as  for  many  other  techniques  (except,  for  example,  for 

-2/3 

the  histogram,  which  is  n ).  However,  these  same  simulations  indicate 
that  the  DMPLE  is  very  efficient  for  the  sampling  densities  examined. 

In  Table  I,  we  examine  sample  sizes  required  to  achieve  an  average  integrated 
squared  error  of  1/400  for  Gaussian  sample  data  N(0,1).  A complete 
picture  of  the  general  behavior  of  the  DMPLE  for  various  penalty  functions 
and  sampling  densities  is  an  open  area  of  research. 
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. . _ * j.v  -.1  .<i^ io6ia*u  \u.iit-i,  uy  line  segments,  it  is 

infrequently  used.  Note  the  DMPLE  has  the  same  form  if  not  origin  as  the 

frequency  polygon.  However,  I have  recently  shown  that  the  frequency  poly- 

gon properly  constructed  actually  shares  the  same  approximation  properties 

as  the  kernel  methods  rather  than  the  poorer  properties  of  the  histogram; 

-4/5 

that  is,  it  converges  at  the  rate  n ; see  Scott  (1982).  This 
observation  was  recently  made  independently  by  David  Freedman  at  S1AC. 
However,  this  is  a whole  paper  in  itself.  But  notice  how  the  frequency 
polygon  behaves  in  Table  I.  This  is  generally  the  case. 

With  the  above  as  background,  we  can  look  more  closely  at  some 

corresponding  two-dimensional  results. 
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symmetry  condition:  KQ(x,y)  = KQ(-x,-y). 

Cacoullus  (1964)  examined  this  general  case  and,  in  fact,  proposed 
the  simpler  product  kernel 


x,  -x  y,  -y  x -x  y,  -y 

Ko(  — • — 1 - v lb 1 V -r-J 

x y x y 


This  form  has  certain  computational  advantages  especially  when  the 


univariate  kernel  has  finite  support.  Cacoullus  wrongly  proves  in 
his  last  theorra  that  optimally  for  product  kernels  we  should  restrict 


ourselves  to 


Then  in  Table  II  we  look  at  the  inefficiency  with  respect  to  average  ISE 

for  the  restricted  optimization  problem  satisfying  equation  (10)  versus 

the  unrestricted  optimization  problem.  The  results  given  in  the  Table 

emphasize  how  large  this  inefficiency  can  be.  The  obvious  fix  is  to 

standardize  the  data  so  s = s . However,  the  behavior  shown  in  Table  II 

x y 

is  the  retult  of  complicated  functions  of  second  order  derivatives  of  f and 
not  simply  functions  of  Lhe  moments,  in  general. 
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For  other  bivariate  methods,  the  above  results  emphasize  the  need  for 
having  at  least  one  smoothing  parameter  for  each  variable. 


3.2  Optimal  Kernels 


For  m>  •'  late  kernel  estimation,  Epanechnikov  (1969)  proved  that 
the  optimal  kernel  was  of  the  form 


K*(x)  = 7-  (1-x2)  -1  < x < 1 


Nezames  has  proven  the  following: 

Theorem:  The  optimal  bivariate  kernel  Kq  is  given  by 

K (x,y)  **  r-  [ 1 - r(x  +y  ) ] x +y  < 6 . 


This  kernel  looks  like  kernel  Gwept  360°  about  the  z-axis.  The  increase 
in  efficiency  of  the  optimal  kernel  (13)  compared  to  other  kernels  is  not 
large,  a situation  similar  to  the  one-dimensional  case  investigated  by 
Epanechnikov;  see  Table  III.  Notice  how  the  product  kernels  are  only 
slightly  inefficient.  The  Gaussian  product  kernel  is  perhaps  surprisingly 
inefficient . 


3.3  Picking  the  Smoothing  Parameters  h^  and  h^ 

All  of  the  one-dimensional  methods  described  in  section  2.1  may  be 
directly  extended  to  the  2-dimensional  case.  Direct  sequential  bivariate 
iterations  are  much  more  time  consuming  and  difficult  to  perform  repeatedly. 
The  test  graph  approach  is  less  easy  to  visualize  than  the  density  estimate 
(using  contours,  say)  because  the  test  graph  will  have  contours  corresponding 
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to  negative  values.  However,  the  test  graph  is  still  a more  sensiti.- 
instrument  than  the  direct  interactive  approach. 


1 


There  also  exists  a bivariate  quasi-optimal  algorithm  implementation 
that  has  been  evaluated  in  some  simple  cases  by  Nezaroes. 

There  are  also  bivariate  extensions  of  rules  of  thumb  based  on 


theoretical  results.  For  example,  for  bivariate  histograms  with 
rectangular  bins  of  size  h^  by  h , the  data-based  rules  are 


(14) 


expressions  virtually  the  same  as  the  one-dimensional  result  in 
equation  (4),  except  for  the  exponent  on  n.  The  first  correction  to 
equation  (14)  is  based  on  the  sample  correlation  coefficient  r. 
Equation  (14)  should  be  divided  by 


( 1 - r ) 


Higher  order  moment  corrections  could  be  developed. 

We  next  consider  the  smoothing  of  a bivariate  series  estimator  using 
the  cross-validation  algorithm  of  Wahba  (1981).  The  smoothing  parameters 
used  minimize  a certain  generalized  cross-validation  functional. 

Depending  upon  the  exact  form  of  the  initial  series  estimator,  you  get 
either  the  algorithm  given  by  Wahba  (1981)  or  a slightly  different  version 
developed  by  Nezames  (1980).  First,  for  n = 50,  and  p «■  .80  with  bivariate 
Gaussian  sample  data,  we  show  contours  of  the  cross-validation  functions  for 
the  two  approaches,  sec  Figures  6 and  7.  The  two  corresponding  estimates 
are  shown  in  Figures  8 and  9. 
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Tiie  edge  off ccts  of  series  estimators  me  well-known  to  exist  but 
are  always  somewhat  surprising  to  see.  In  Figure  10  we  show  an 
estimate  for  n = 100,  p = 0,  bivariate  Gaussian  data.  Notice  how  the 
periodic  nature  of  the  solution  is  clear. 

3.4  Rates  of  Convergence 

In  Taole  IV,  we  summarize  the  rates  of  convergence  of  the  various 
density  estimators.  The  frequency  polygon  again  performs  very  well. 
Notice  that  the  two-dimensional  kernel  methods  have  the  same  convergence 
rate  as  the  one-dimensional  histogram.  As  an  aside,  the  bivariate 
frequency  polygon  may  best  be  constructed  using  histograms  with  base 
bins  in  the  shape  of  hexagons;  that  is,  a shape  capable  of  tiling 
the  plane  and  approximating  a circle. 

3.5  Bivariate  DMPLE 

Nezames  has  implemented  the  Bivariate  DMPLE  for  the  class  of 
piecewise  constant  functions.  As  an  example,  n **  200,  p = 0 bivariate 
Gaussian  sample,  the  histogram  is  shown  in  Figure  10  and  the 
corresponding  DMPLE  in  Figure  11.  Notice  the  reduction  in  noise  and 
false  peaks  in  Figure  11. 

3 . 6 Scatter  Diagrams  or  Density  Estimate  Contours? 

One  thing  statisticians  are  supposed  to  do  well  is  examine  scatter 
diagrams,  such  as  those  from  residual  plots.  It  has  been  my  experience 
that  the  naked  scatter  diagram  is  a difficult  object  to  "see." 


For  example, 
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concider  the  blood  lipid  (fat)  data  shown  in  Figure  12  (Scott,  et  al,  1978). 
These  data  represent  the  cholesterol  and  triglyceride  values  of  320 
males  with  angiographically  demonstrated  coronary  artery  disease.  Now  look 
at  the  contours  of  a kernel  estimate  of  the  same  data  shown  in  Figure  13. 

The  bimodal  feature  was  an  important  undiscovered  feature  in 
previous  analyses  of  these  data. 
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1 . INTRODUCTION 


An  important  question  in  cluster  analysis  and  pattern  recognition 
is  the  determination  of  the  number  of  clusters  into  which  a given  population 
should  be  divided.  Frequently,  particularly  when  certain  specific  clustering 
methods  are  being  used,  the  number  of  clusters  is  taken  to  be  equal  to  the 
number  of  inodes,  or  local  maxima,  in  the  probability  density  function 
underlying  the  o^ven  data  set;  in  some  applications  this  question  is  of 
direct  interest  in  its  own  right. 

Investigation  of  the  number  of  local  maxima  in  a density  or  its  derivative 
has  been  considered  by  several  authors,  for  example  Cox  (1966)  and  Good  and 
Gaskins  (1980).  Most  methods  seem  to  depend  on  some  arbitrary  implicit  or 
explicit  choice  of  the  scale  of  the  effects  being  studied,*  see  the  remarks 
of  Silverman  (1980) . The  simple  approach  based  on  kernel  density  estimates 
described  m this  paper  has  the  virtue  of  making  this  choice  in  an  automatic 
and  natural  way. 

The  use  of  kernel  density  estimates  in  mode  estimation  was  originated 
by  Farzcn  (1962).  The  'gradient  method'  of  cluster  analysis  is  based  on 
cluster inq  towards  modes  in  the  estimated  density;  see,  for  example,  Andrews 
(1972),  Fukunaqa  and  Hostetler  (1975),  and  Hock  (1977). 

In  Section  2 below  the  test  statistic  to  be  used  is  defined,  and  a 
bootstrap  technique  for  assessing  sianificance  is  qiven  in  Section  3.  An 
illustrative  application  is  qiven  in  Section  A.  In  Sections  5 and  6 the 
asymptotic  behaviour  of  the  test  statistic  is  discussed. 

For  a published  version  of  this  work,  see  Silverman  (1901a)  and 


Silverman  (1983) . 
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2.  The  Critical  Window  Width 

A possible  test  statistic  for  hypotheses  concerning  the  number  of  modes  in  the  density  can  be 
obtained  by  constructing  kernel  density  estimates  of  the  data.  The  kernel  density  estimate 
(Rosenblatt,  1956)  for  window  width  It  based  on  univariate  observations  X,,.. , X„  is  defined  by 

/(r;  h)  = n~l  h~l  £ K{h'\t- X,)}t  (1) 

i- 1 

where  K is  a kernel  function,  which  we  shall  assume  throughout  to  be  the  normal  density 
function.  Apart  from  the  theoretical  advantages  of  this  choice,  the  use  of  a normal  kernel  has  # 
strong  computational  advantages;  see  Silverman  (1981b)  i 

The  window  width  h controls  the  amount  by  which  the  data  are  smoothed  to  obtain  the 
kernel  estimate.  Thus,  for  example,  if  the  data  are  strongly  bimodal  a large  value  of  h will  be  j 

needed  to  obtain  a unimodal  estimate.  Suppose  that  we  wish  to  test  the  null  hypothesis  that  the  j 

density/ underlying  the  data  has  k modes,  against  the  alternative  that /has  more  than  k modes;  i 

often  k = 1.  Define  the  k-critical  window  width  hcrlt  by  i 

hcr„  = inf  {/i;  has  at  most  k modes}.  (2) 

Large  values  of  henl  will  reject  the  null  hypothesis.  Silverman  (1978c^used  a critical  value  of  a ! 

smoothing  parameter  in  a somewhat  different  context  The  computation  of  /icru  in  practice  is  i 

facilitated  by  the  following  theorem  and  corollary. 

Theorem  Given  any  fixed  Xl,...,XK,  define / (i,  h)  as  in  (1)  above,  using  a normal  kernel  K 
For  each  integer  m^O,  the  number  of  maxima  as  t varies  in  m is  a right  continuous 
decreasing  function  of  h. 

The  following  corollary  follows  at  once. 

Corollary.  Defining  licnl  as  in  (2)  above,/  h)  has  more  than  k modes  if  and  only  if  h < hcnt. 

The  corollary  shows  that  /icr(1  can  be  found  by  a binary  search  procedure,  since  for  any  value 
of/i  we  can  tell  at  once  whether  or  not  /i</icr„  by  counting  the  number  of  modes  in /(.;  h).  The 
result  is  also  used  in  Section  3 below 

This  section  is  concluded  with  the  proof  of  the  theorem,  which  makes  use  of  the  theory  of 
total  positivity;  see,  for  example,  Karlin  (1968)  Let  vm+  ,(/i)  denote  the  number  ofsign  changes  in 
Jim  * u(.,  n).  Since  ( - t)T  * 1/“"  + 1 ' (t,  h)  is,  for  all  m 5s  0 and  h,  eventually  positive  as  r — - oo  and  as 
t -*  co,  it  suffices  to  show  that  vm+ , is  decreasing  and  right  continuous.  For  h2>hl>0, 
r the  convolution  of/(m+  u(.,/i,)  with  a N{0,hl--hzt)  density,  and/<m+ “(., A,)  is 
continuous  and  bounded.  Thus,  by  Theorem  2 of  Schoenberg  (1950),  v„+  i(/i2)<  vM+  ,(/i,)so  that 
vw+i  is  decreasing.  Now  suppose,  for  given  h0  >0,  there  exist  a,  <6,  <a2  <...<ar<br  such  that 
fim*l)(a„h0)> 0 and  /”"  + "(6„/io)<0  for  all  / By  the  continuity  of  /(m+  “(r,.),  for  all 
sufficiently  small  c and  all  i,  n(ahh0  +c)>U  and  /1'"+,,(bi,  h0+e)< 0.  Hence  lim 
«nf*u0vm+  ,(h)^vm+  ,(/i0),  the  right  continuity  of  vm+  , follows  from  the  fact  that  vm+  , is  known 
to  be  decreasing. 

Note  that  Schoenberg’s  theorem  does  not  apply  for  general  kernels  Indeed,  the  convolution 
of  unimodal  densities  need  not  be  unimodal,  see  Feller  (1966,  p.  164). 


11^.;.^,^ aa.  > Sr—..,- 
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3.  Assessing  Significance 

For  any  particular  /c-modal  simple  null  hypothesis,  it  is  easy  to  assess,  by  simulation,  the 
significance  of  the  value  of  the  critical  window  width  obtained  from  the  data.  Suppose  the  null 
hypothesis  is  that  the  true  density  is  g and  that  the  value  of  /icrit  obtained  from  the  data  is  h0 
Then  the  theory  of  Section  2 implies  that 

Pr«(^cfit>^1o)  = Pr{/ (•!  *o)  has  more  than  k modes  | Xn}  is  drawn  from  g }. 

Thus,  in  order  to  assess  the  significance  of  h0  for  sample  size  n,  it  is  only  necessary  to  calculate 
the  single  density  estimate  / (.;  h0)  for  each  sample  of  size  n generated  from  g\  there  is  no  need  to 
find  /iCfi,  for  each  replication. 

The  hypothesis  that  the  true  density  is  at  most  fc-modal  is  of  course  a compound  hypothesis. 
To  provide  a conservative  assessment  of  the  significance  of  h0,  an  appealing  choice  of  the 
representative  g0  from  which  to  simulate  is  obtained  by  rescaling/ (.,fi0),  as  constructed  from 
the  data,  to  have  variance  equal  to  the  sample  variance.  The  theory  of  Section  2 shows  that  gQ  is 
indeed  at  most  fc-modal;  it  is,  in  a sense,  the  most  extreme  fe-modal  density  consistent  with  the 
data.  It  is  extremely  easy  to  simulate  from  g0;  Efron  (1979)  pointed  out  that  independent 
observations  y,  from  g0  are  given  by 

y<  = (1  + /»o/<*J)"  *(*/<,> + M.). 

where  X ,(0  are  sampled  uniformly,  with  replacement,  from  the  data  X X„,  a1  is  the  sample 
variance  of  the  data,  and  e{  is  an  independent  sequence  of  standard  normal  random  variables. 

Simulating  from  g0  to  assess  significance  is  an  example  of  a smoothed  bootstrap  procedure 
as  defined  by  Efron  (1979).  However,  Efron’s  procedure  contains  an  implicit  arbitrary  choice  of 
smoothing  parameter,  since  his  <r§  is  essentially  arbitrary.  In  our  case,  the  amount  of  smoothing 
is  automatically  determined  in  a natural  way. 

Finally,  it  should  be  pointed  out  that  the  theory  and  procedure  of  finding  a critical  window 
width  and  simulating  from  a rescaled  density  estimate  constructed  using  this  window  width 
carries  over  immediately,  mutatis  mutandis,  to  the  investigation  of  maxima  in  the  first  or  higher 
derivative  of  the  data.  Both  Cox  (1966)  and  Good  and  Gaskins  (1980)  show  a preference  for 
seeking  maxima  in  the  density  derivative. 
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4.  An  Application 

We  illustrate  the  method  by  analysing  a small  data  set  of  observations  on  chondrite  meteors. 
These  data  consist  of  22  observations  which  are  given  in  Table  2 of  Good  and  Gaskins  (1980). 


r 

r Table  1 


i. 

y 

Chondrite  data  critical  window  widths  and  their 
estimated  significance  levels 

1 

Number  of  modes 

Critical  window  width 

P 

7 " 

! 

1 

2-39 

0-08 

2 

183 

0-05 

3 

0-68 

0*79 

* * 

4 

0-47 

0-93 

The  data  have  been  considered  by  several  authors,  see  Good  and  Gaskins  (1980)  for  details  In 
this  analysis  the  raw  values  of  the  observations  were  used  Table  1 gives  critical  window  widths 
and  significance  levels  for  tests  of  the  null  hypothesis  that  the  underlying  density  has  at  most  k 
modes  against  the  alternative  that  it  has  more  than  k modes  The  p-values  are  computed  by 
simulating  from  a critical  density  as  described  above;  100  replications  of  22  observations  were 
used  in  each  case. 

These  results  must  of  course  be  interpreted  as  a hierarchical  set  of  significance  tests  All  other 
things  being  equal,  considerations  of  parsimony  perhaps  suggest  that  we  should  test 
successively  for  an  increasing  number  of  modes  until  we  find  a number  that  is  accepted. 
Particularly  bearing  in  mind  the  small  sample  size,  the  results  clearly  indicate  the  trimodal 
nature  'he  population;  Good  and  Gaskins  (1980)  also  arrived  at  this  conclusion. 


fl 
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5.  ASYMPTOTIC  BEHAVIOUR  OF  THE  CRITICAL  WINDOW  WIDTH  : 

INTRODUCTION 


In  Section  2 above  it  was  stated  heuristically  that  larqe  values 
of  h will  tend  to  reject  the  null  hypothesis.  The  results  of  this 

section  show  that  this  procedure  does  indeed  lead  to  a consistent  test. 

Subject  to  certain  regularity  conditions,  it  Is  shown  that,  under 
the  null  hypothesis,  ^cr^t  converges  stochastically  to  zero,  while 
this  is  not  the  case  under  the  alternative  hypothesis.  The  exact  rate 
of  convergence  of  h to  zero  under  the  null  hypothesis  is  found. 

It  is  perhaps  interesting  that  this  rate  of  convergence  has  precisely 
the  same  order  as  the  rate  of  convergence  for  the  optimum  choice  of 
window  width  for  the  uniform  estimation  of  the  density  given,  for 
example,  by  Silverman  (1978b). 

In  the  smoothed  bootstrap  procedure  given  in  section  3,  the 

representative  of  the  null  hypothesis  constructed  from  the  data  is 

obtained  from  the  density  estimate  with  window  width  h s the 

cnt 

estimate  is  rescaled,  as  suggested  by  Efron  (1979) , to  have  variance 

equal  to  the  sample  variance  of  the  data.  The  remarks  above  show  that 

f (.,h  .1  is,  in  a certain  sense,  optimally  uniformly  consistent  as  an 

n cnt 

estimate  of  the  true  density  f.  It  follows  that,  on  the  null  hypothesis, 
the  bootstrap  procedure  is  likely,  at  least  for  large  samples,  to  provide 
an  estimate  of  the  true  underlying  density  which  is  accurate  in  the  uniform 
norm.  A possible  drawback  for  small  samples  is  the  fact  that  the  implied 
constant  in  the  rate  of  convergence  does  not  necessarily  take  its  optimum 
value. 

An  interesting  open  question  raised  by  this  discussion  is  the  possibility 

of  using  h , (k)  for  some  value  of  k in  developing  an  automatic  method 
crib 

for  choosing  the  smoothing  parameter  m density  estimation.  Boneva,  Kendall 
and  Stefanov  \1971)  suggested  choosing  the  window  width  where  'rabbits’  or 
rapid  fluctuations  just  started  to  appear.  Such  a window  width  would 
perhaps  correspond  to  hcrit(k)  for  some  k > j ; since  converges 

to  zero  at  the  optimum  rate  for  all  k > j,  a suitable  formalization  of  the 
Boneva-Kendall-Stefanov  procedure  would  give  estimates  which  converged  at 
the  optimal  rate,  though  not  necessarily  with  the  optimal  constant  multiplier. 


t 


1 


- 4 


OhluiMAL  iV.v.'-  W zui 

OF  POOR  QUALITY 


The  fact  that  h . (k)  has  the  same  rate  of  convergence  for  all  k > j 
cnt 

provides  some  explanation  for  the  observation  made  by  Boneva,  Kendall  and 
Stefanov  that  the  estimate  seems  suddenly  to  become  noisy  a3  the  window 
width  is  reduced. 
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6.  ASYMPTOTIC  RESULTS 


In  this  section,  the  main  results  on  the  asymptotic  behaviour 

of  h , are  stated  and  proved.  It  is  convenient  to  use  the  convention 
crit 

throughout  that  all  limits  and  implied  limits  are  taken  as  n tends  to 
infinity.  Varying  conventions  will  apply  to  unqualified  suprema  and  infima 
in  Propositions  1 and  2 below,  and  these  will  be  introduced  where  necessary. 
The  notations  p lim  inf  and  p lim  sup  will  be  used  to  signify  the 
corresponding  limits  in  probability  as  n tends  to  infinity,  and  0^  and 
o^  will  denote  probability  orders  of  magnitude. 

Define,  for  h > 0, 

a(h)  - h ^ log(h  ')  . (1) 

The  roam  results  are  all  contained  in  the  following  theorem. 

Theorem 

Suppose  f is  a bounded  density  with  bounded  support  (a.b) , and 


suppose 

that  the  following  conditions  are 

satisfied: 

(i) 

f 

is  twice  continuously  differentiable  on  !a,b) 

< ii  > 

f 

has  exactly  3 local  maxima  on 

(a,b) 

mi) 

f 

Mat-)  > 0,  f*(b-)  < 0 

(IV) 

f-(z)2 

min  — — ; — - — * c >0. 

/ % ,>1  f(z)  o 

Izzi  < z)-0) 

Let  be  the  k-criticai  window  width  constructed  from  an  i.i.d. 

sample  of  size  n from  f.  Then ^ if  k > j,  defining  a as  in  (1)  above, 

p lim  inf  (k)}  > f-  c (2) 

•=SAt  3 o 

and  p 11a  aup  n crfb‘  . (k)}  < *»  (3) 

£rit 

while  if  k < j then  there  exlsts-'a’  constant  h0(f,k)  such  that 

P(h„.  Ik)  > h } - 1 . (4) 

crit  o 

Note  that  condition  (iv)  is  equivalent,  in  the  presence  of  the  other 
conditions,  to  the  condition  that  f is  strictly  positive  on  (a,b)  and 


f'  has  no  multiple  zeroes  on  (a,b]> 
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It  is  convenient  to  prove  the  various  assertions  of  the  theorem 
separately.  Except  where  otherwise  stated,  the  conditions  of  the  theorem  on 
f will  be  assumed  to  be  true  throughout.  The  first  proposition  facilitates 
the  proof  of  (2). 

Proposition  1.  Given  any  c^  with 

0 < c.  < — r/2  c , 

13  o 

suppose  the  sequence  of  window  widths  h^  satisfies 

n ^(h  ) ♦ c . (5) 

n i 

Then  the  number  of  maxima  of  f tends  in  probability  to  3 . 

It  follows  from  Proposition  1 and  t.'vj,  fhtfO'fw£j&HA\2that,  for  all  k > 3 , 
provided  (5)  holds. 


‘ h.>  * 1 

and  hence  that  (2)  is  satisfied. 

The  proof  of  Proposition  1 makes  use  of  several  lemmas,  the  first  of 
which  shows  that,  under  certain  conditions,  maxima  and  minima  of  fn  can, 
eventually,  only  occur  arbitrarily  close  to  those  of  f. 

Lemma  1.  Let  I be  any  closed  Interval  contained  in  [a,b],  such  that  I 

contains  none  of  the  zeroes  of  f * . Then,  provided  h^  ♦ 0 and 

n 1h2a(h  ) ♦ 0,  it  will  follow  that 
n n 

P(f  monotonic  on  I in  the  same  sense  as  f)  ♦ 1 . 

n 

Proof.  By  slight  adaptation  of  the  results  of  Silverman  (1978a),  it  can  be 
seen  that,  provided  f is  bounded,  we  will  have,  if  hjj  satisfies  the 
assumptions  of  Proposition  1, 


sup  | f * - Ef  ’ | - O {n 
(— <-)  n n -p 


2h_1a(h  ) 
n n 


o (1) 
”P 


(6) 
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In  Silverman  (1978a)  the  uniform  continuity  of  f was  additionally  assumed, 
but  careful  examination  of  the  proofs  of  that  paper  shows  that  the  derivation 
of  the  rate  of  stochastic  convergence,  though  not  of  the  exact  constant 
implied  in  the  Op,  goes  through  under  the  assumption  of  bounded  f. 

Supposing  without  loss  of  generality  that  f is  increasing  on  I,  it 
follows  from  the  continuity  of  f*  on  la,b]  that  f'  is  bounded  away  from 
zero  on  I and  is  non-negative  on  a neighborhood  of  I,  and  hence  by 
elementary  analysis  that 

lira  inf  inf  Ef*  > 0 . (7) 

I ° 

Combining  (6)  and  (7)  completes  the  proof  of  Lemma  1. 

The  next  lemma  shows  that,  under  suitable  conditions,  fR  will 

eventually  have  exactly  one  maximum  and  no  minima  near  each  maximum  of  f, 

and  exactly  one  minimum  and  no  maxima  near  each  minimum  of  f . 

Lemma  2 . Suppose  f'(z)  “ 0 and  f has  a local  maximum  (respectively 

minimum)  at  z.  Suppose  h + 0 and 

— - — n 

n_1<x(h  ) ♦ c„  « (0,  \ r/2  f"(z)2/f (z) ) . (8) 

n 2 j 

Then,  for  all  sufficiently  small  e > 0,  the  probability  that  f^  has 
exactly  one  zero  in  ( z-E , *+£ ) , and  that  this  zero  is  a maximum 
(respectively  minimum)  of  fn,  tends  to  one  as  n tends  to  infinity. 

Proof.  Only  the  case  of  a local  maximum  will  be  considered.  The  proof  for  a 
minimum  proceeds  very  similarly  and  is  omitted.  Throughout  this  proof 
unqualified  infina  and  supreme  will  bo  taken  to  be  over  x in  [z-E,  z+c] . 

By  the  continuity  of  f and  f",  choose  c sufficiently  small  that 

inf  f"(x)2  3c2 

sup  f(x)  2w/]2 


(9) 


t 
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and  also  [ z-c,  z+c)  £ (a,b).  It  is  then  immediate  that  f*(z-c)  > 0 and 

f'(z+c)  < 0 since,  by  (9),  f"  cannot  cross  zero  in  (z-c,  z+c).  Since 

f*  is  continuous  at  z ± c,  by  standard  results  on  the  consistency  of  f' 

(a  combination -of  Parzen  (1962)  and  Bhattacharya  (1967)) 

P{f'(z-C)  > 0 and  f’(z+e)  < 0}  ♦ 1 . (10) 

n n 

Very  slightly  adapting  the  proofs  of  Silverman  (1976  and  1978a)  to  cope 
with  the  fact  that  f"  is  only  uniformly  continuous  on  a neighborhood  of 
(z-e,  z+c)  gives 

_£  £ 

n 2a(h)2  sup|  f " ( x)  - Ef"(x)|  J?  K 
n n * 

where 


:2  - 2 sup  f / 


-1 


3 (2»/2 ) sup  f 


Since,  by  elementary  analysis,  suplEftx)  - f"(x}|  converges  to  zero,  it 

n 

£ 

2 

follows  from  (8)  that  p lim  sup  supjf"(x)  - f(x)|  < K,c 

n n i 2 

< inf | f "(x) | 

by  (9).  It  is  immediate  that 

P{f"(x)  < 0 for  all  x in  tz-e,  z+c))  +1  • (11) 

n 

Combining  (10)  and  (11)  completes  the  proof  of  Lemma  2. 

To  complete  the  proof  of  Proposition  1,  note  first  that  no  maxima  of  fR 
can  occur  outside  the  interval  (a,b).  Let  z1 , . . . , Zj j_i  be  the  zeroes  of 
f*  in  (a,b)  and  choose  c sufficiently  small  to  satisfy  the  conclusion  of 
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Lemma  2 for  all  and  to  ensure  that 


a < z^-e  < z^+e  < 22_C  <***<  Z2j  i+c  < b 


(12) 


Applying  either  Lemma  1 or  Lemma  2 as  appropriate  to  each  of  the  intervals  in 
the  partition  (12)  of  the  interval  (a,b)  completes  the  proof  of  Proposition 
1. 

The  next  proposition  leads  to  the  proof  of  assertion  (3),  in  a similar 
way  to  the  derivation  of  (2)  from  Proposition  1. 

Proposition  2 

Defining  a as  in  (1)  above,  suppose  that 


-1 


n a(h  ) ♦ 
n 


, -1.-5 

and  n h^  ■»  0 


(13) 

Then  the  number  of  maxima  in  f tends  m probability  to  infinity. 

Given  any  k,  it  follows  from  this  result  and  the  corollary  of  Silverman 

(1981)  that,  provided  (13)  holds, 

P{h  fJc)  > h}  -*  1 ; 

cnt  n 

assertion  (2)  follows  at  once. 

To  prove  Proposition  2,  suppose  without  loss  of  generality  that  f has  a 

maximum  at  0 in  (a,b).  Choose  a sequence  l which  satisfies 

n 


1 ♦ 0,  h 'l  ■ o{n  1a(h  )} 

n n n — n 


(14) 


h-1l 
n n 


and  | log  l I I log  h | 
n n 


The  explicit  dependence  of  h and  i on  n will  often  be  suppressed.  Let 
Ij  n be  the  interval  l(j-1)l,  jt]  for  integer  j > 0. 

Following  Silverman  (1978a)  apply  Theorem  3 of  Komlos,  Major  and  Tusr.ady 


7 , 

J i 


_2 

f*(x)  » Ef’(x)  + h 'n  2p,{x) 
n n 1 


c*  (x) 


( 1975  ) to  obtain 


I 
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where  is  a Gaussian  process  with  the  same  covariance  structure  as 

2* 

n h(f'  - Ef*)  and  t'  is  a secondary  random  error.  The  process  p,  is 
n n n 1 

obtained  by  putting  6(u)  equal  to  $'(u)  in  Proposition  1 of  Silverman 

(1978a).  By  elementary  analysis  and  the  arguments  of  Silverman  (1978a)  we 

have,  in  a neighborhood  of  0, 

I Ef ' ( x)  - fix)  | - 0(h)  ; 

n — 


|e'(x)|  » 0(n  'h  2log  n) 
n — 


a.  s. 


- o(h  ) from  (13)  above  ; 
and  | f ' (x) | - 0(x)  , 


since  f'(0)  « 0 and  f"  exists.  It  follows  that,  a.s.. 


sup|Ef'(x)  + c’(x)|  " 0(3!)  + 0(h) 
n n — — 


- o{n  1h  ^log(i/h)} 


(15) 


by  (13)  and  (14)  above,  where  we  adopt  the  convention,  here  and  subsequently 
in  this  proof,  that  unqualified  supreme  are  taken  to  be  over  the  interval 
I and  that  a fixed  j is  being  considered. 

We  slightly  adapt  the  argument  of  Silverman  (1976)  pp.  138-140  tc 
investigate  sup  p • Define 

02(x)  - var  Pl(x)  - h_1f(x)  / d,2(1  + o(1)) 

- h_1f(0)  J $,2(1  + o(1))  for  x in  I 

— J»n 

since  the  end  points  of  n both  converge  to  zero.  Analogously  to  (12)  of 
Silverman  (1976),  given  any  \ in  (0,2), 


--  t t 


iji  r- 


i i 


pfsup  0 1P1  <•  (1  - J X)  {2  log  (h  >}21 

< OU^JlogJh'V) 


(16) 


x If  lxlexPt2  log(h  V)(1  A) 2 lx l/( 1 ♦ lxl)> 

1j.n 


1 

Id"! 


where  X(*«y>  " corr{p(x) ,p(y)} . Using  a similar  argument  to  that  following 
(12)  of  Silverman  (1976),  but  allowing  the  interval  I to  vary,  shows  that 
the  expression  m (16)  is  dominated  by 

(1  1 ^ y 2 

on'2)  log(h-1i)  (a2(0)  ♦ o(l)}"1  { h” 1 1 } 2 0(1) 

-(hi)  log(h  1)  - 0 


by  ( 1 a ) above. 


2 2 

It  follows  that,  setting  X - {2f(0)  / #'  } , 


p lin  inf  sup{*  ' log(h  1l)}‘Jp^  > K 


(17) 


\ i 

■ . ( 


and  that  the  same  result  holds  if  is  replaced  by  -p ^ , giving  a 
corresponding  result  for  inf  It  follows  from  (15),  (17)  and  thi 
corresponding  result  for  inf  that 


p{p.  crosses  -n  h(Lf'  + c')  in  I.  } ♦ 1 , 

1 n n 3,n 


and  hence  that 

p{.f  crosses  zero  in  1.  } ♦ 1 

n 3,n 

Since  (10'  holds  for  all  J,  the  number  of  maxima  in  f„  tends  In 
probability  to  infinity,  completing  the  proof  of  Proposition  2. 


(18) 


w.  .f 


■^.Ww. 


— -■  • - — .t—  ,i — _L  v.v-v-  * ~~ 
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The  final  proposition  of  this  section  deals  with  the  case  where  the  j 

alternative  hypothesis  is  true,  and  shows  that  hcrit  will  remain  bounded 

) 

» 

away  from  zero. 

Proposition  3 j 

I f k < j then  there  exists  a constant  hQ  > 0,  depending  on  f ! 

and  k , such  that 

P{ h {k}  > h } - 1 . 1 

crit  o 

Proof 

■ i ■ - i 

By  arguments  analogous  to  those  of  the  proof  of  the  theorem  of  ? 

, making  use  of  the  variation  diminishing  properties  of  the  Gaussian 

kernel  and  the  continuity  properties  of  Efn,  the  number  of  maxima  in 

Ef  (*,h)  is  a right  continuous  decreasing  function  of  h,  for  h > 0.  By 
n 

choosing  hQ  sufficiently  small,  we  can  ensure  that  Ef^(*,h0)  ha«= 

independently  of  n,  exactly  j maxima.  Because  of  the  conditions  imposed 

o i f m the  statement  of  tne  Theorem  above,  we  can  also  ensure  that 

Ef“(*,h  ) is  non-zero  at  all  stationary  points  of  Ef  (*,h.). 
n 0 no 

The  argument  of  Lemma  2.2  of  Schuster  (1969),  which  does  not  in  fact 

require  the  convergence  to  zero  of  the  sequence  of  window  widths,  then  implies 

tnat,  with  probability  one, 

f-(x.h0)  - Ef;<x,h0)  and  f"(x,h0>  - Ef;(x,hQ) 

both  converge  to  zero  uniformly  over  x.  By  an  argument  similar  to  that  used 

in  Proposition  1 above,  it  follows  that  the  number  of  maxima  of  f (*,h  1 on 

no 

(a,b)  tends  almost  surely  to  j,  the  number  of  maxima  of  Ef  (»,h„). 

n o 

Applying  the  corollary  or  &2(^tCv\.  X.  completes  the  proof  of  Proposition 

3. 


klJ 
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Discussion 


It  is  natural  to  enquire  to  what  extent  the  conditions  of  the  theorem 

above  can  be  relaxed  without  affecting  the  conclusions.  In  particular  it 

seems  intuitively  clear  that  the  condition  of  bounded  support  for  the 

density  f should  be  able  to  be  replaced  by  some  condition  on  the  tails  of 

f,  though  the  present  method  of  proof  cannot  deal  with  this  case.  Condition 

(iv)  appears  to  be  more  fundamental  to  the  result;  if,  for  example.  f'(0)  » 

f"(0)  ■ 0 1 f " * ( 0 ) , then  an  examination  of  £n  and  Efn  near  zero  seems  to 

indicate  that,  under  suitable  regularity  conditions,  there  will  be  no  maximum 

of  f near  zero  provided  |f"*  - Ef"*J  remains  small.  A heuristic  argument 
11  n n 

suggests  that  a result  corresponding  to  the  theorem  of  Section  2 can  be 
proved,  but  with  a(h)  replaced  by  h”^log( h” 1 ) , so  that  l*crlt  converges 
to  zero  more  slowly.  Even  slower  convergence  will  occur  for  higher  order 
zeroes  in  f*. 

The  interest  in  this  discussion  lies  in  the  fact  that  the  bootstrap 
density  constructed  using  the  critical  window  width  will  not  only  have 
infinite  tails  of  similar  weight  to  those  of  the  corresponding  normal  kernels 
but  will  also  have  a stationary  point  which  is  a point  of  inflexion.  The 
•lower  convergence  to  zero  of  h r<t.  provides  support  for  the  remark  <Cvi 
Stc-h**  3 that  the  bootstrap  test  may  be  conservative;  it  also  bears 

out  the  intuition  of  P.  Ruber  communication)' 'that  the  bootstrap 

procedure  may  be  excessively  conetfirjraiive,  though  -the ..differ once  between 

'T*/ 

_ 1 _ I 

n ^ and  n ^ convergence  is  very  slight  in  practice. 

The  methods  of  this  paper  can  also  be  used  to  study  the  asymptotic 
properties  of  a corresponding  test  for  tho  number  of  points  of  inflexion  in 


the  density.  Both  Cox  (196o)  and  Good  and  Gaskins  (1980)  prefer  to  use  points 


of  inflexion  as  an  indication  that  the  density  is  a mixture.  The  critical 
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window  width  will  now  be  the  smallest  window  width  for  which  the  density  has 

K maxima.  Under  suitable  conditions  a result  corresponding  to  the  theorem  of 

Section  2 can  be  proved,  but  again,  among  other  changes,  a(h)  will  be 

replaced  by  h”7log{1/h)  since  f"  will  be  replaced  by  f " ' in  much  of  the 

n n 

argument  of  the  proofs  of  Propositions  1 and  2. 
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A DATA  BASED  RANDCM  NLMBER  GENERATOR  FOR  A MULTIVARIATE  DISTRIBUTION* 

(USING  STOCHASTIC  INTERPOLATION) 

James  R.  Thompson  Rice  University 

Malcolm  S.  Taylor  USA  Ballistic  Research  Laboratory 

ABSTRACT.  Let  X be  a k-dimensional  random  variable  serving  as  input  for  a 
system  with  output  Y (not  necessarily  of  dimension  k).  Given  X,  an  outcome 
Y or  a distribution  of  outcomes  G(Y|X)  may  be  obtained  either  explicitly  or 
implicitly.  He  consider  here  the  situation  in  which  we  have  a real  world 
data  set  {Xj ) an(*  a neans  of  simulating  an  outcome  Y.  A method  for 
empirical  random  number  generation  based  on  the  sample  of  observations  of 
the  random  variable  X without  estimating  the  underlying  density  is  discussed. 


INTRODUCTION.  The  manner  of  dealing  with  multivariate  data  depends  upon  the 
application  at  hand.  For  example,  let  us  suppose  that  is  a sample 

of  size  n of  a k-dimensional  random  variable.  We  may  be  interested  simply 
in  estimating  the  mean  y.  In  such  a case,  we  may  complete  .our  task  by  com- 
puting the  sample  mean  X.  If  we  are  interested  in  the  interrelationships 
between  the  various  vector  components,  we  may  find  it  desirable  to  compute 
the  sample  covariance  matrix  5. 


At  a greater  level  of  complexity,  we  may  be  required  to  estimate  the 
density  of  X nonparametrically  [1,3].  Here,  the  representational  difficulties 

are  substantial particularly  for  k > 2,  where  our  3-dimensional  intuitions 

are  inadequate  for  graphing  the  density  even  if  we  knew  it  precisely  on  a 
discrete  mesh.  Indeed,  it  would  appear  that  for  increasing  dimensionality, 
our  estimation  theoretic  difficulties  pale  in  comparison  to  those  of  repre- 
sentation, 
it 

This  research  was  supported  in  part  by  ARO  Contract  nAAG-29-82-K-0014  at 
Rice  University.  To  appear  in  Proceedings  of  the  Twenty-Seventh  Conference 
on  the  Design  of  Experiments  in  Army  Research  Development  and  Testing. 
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• Suppose  we  are  given,  for  example,  the  task  of  estimating  the  density 
f at  a point  XQ  in  k-space,  based  on  a sample  of  size  n.  The  naive  nearest  I j 

• 4 * 

neighbor  estimator 

'f 

•“  n . ■VktXo!d(Xo,p))  * 

til 

where  d(XQ,p)  is  the  Euclidean  distance  from  XQ  to  the  p nearest  neighbor 

— 

and  Vk(Xo,d(Xo,p))  is  the  volume  of  the  k-sphere  centered  at  XQ  with  radius 
d(Xo,p) , is  likely  to  be  quite  satisfactory.  But  a problem  occurs  when  we 
are  asked  for  a usable  summary  of  the  unknown  density  over  the  space  of  non- 
negligible  mass.  If  we  know  the  functional  form  of  the  density  f(X;9),  , 

then  we  have  a relatively  easy  task — the  estimation  of  8.  But  in  the 

i 

highly  ubiquitous  nonparametric  situation,  in  which  we  do  not  know  the  func- 
tional form  of  f,  we  are  not  so  fortunate.  We  might  decide,  for  example,  to 

*■  lc 

tabulate  f on  a mesh  of  size  20  in  each  dimension.  This  would  require  20 

pointwise  estimations  of  f a tedious  but  manageable  task.  But  how  shall 

we  scan  this  k-dimensional  table  to  obtain  a useful  feel  for  the  density? 

Other  approaches,  clearly  are  required.  One  of  these  is  discussed  in  [2] . 

There  are,  happily,  cases  in  which  the  density  representational  diffi- 
culties may  be  sidestepped  when  coping  nonparametrically  with  data  sets  in 
higher  dimensions.  For  example,  let  us  suppose  the  k-dimensional  random 
variable  X is  an  input  into  a system  with  output  Y (of  whatever  dimension) . 

Given  X,  an  outcome  Y or  a distribution  of  outcomes  G(Y|x)  is  obtained 
either  explicitly  or  implicitly  through  an  output  data  set.  Let  us  suppose 
these  outcomes  fall  into  six  categories:  Very  Good,  Good,  Fair,  Poor,  Very 

Bad,  Catastrophically  Bad.  Suppose  further  that  these  sets  are  well-defined 
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in  the  Y-space.  We  are  given  a real  world  data  set  {X^ ) • ‘We  have  a 
means  of  simulating  an  outcome  Y given  the  input  X.  Ke  wish  to  determine 
the  probability  of  arriving  in  each  of  the  six  category  sets. 

One  way  to  achieve  this  result  might  be,  simply,  to  sample  from  the  n 
data  points  * *n  many  cases  this  will  prove  quite  satisfactory. 

But  let  us  suppose  that  "Catastrophically  Bad"  happens  for  Y > 10, 

4 2 

where  Y = 1/  £ x^  with  X = (Xj,X2,X2,x^) . 

Then,  if  the  x^’s  are  (unbeknownst  to  us,  but  in  actuality)  independently 
distributed  as  N(0,1),  the  chance  of  a "Catastrophically  Bad"  event  is 
.0012.  Let  us  suppose  the  size  (n)  of  our  data  set  is  100.  The  chance  of 
none  of  these  observations  being  in  the  "Catastrophically  Bad"  region  is 
.887.  So,  a simulation  which  used  only  the  100  data  points  would,  with 
probability  .887,  give  us  the  information  that  "Catastrophically  Bad" 
occured  with  zero  probability.  We  need  to  avoid  this  pitfall. 

One  procedure  would  be  to  estimate  the  density  of  X nonparametrically 
and  then  build  a random  number  generator  using  the  density.  Such  a scheme 
would  run  into  the  representational  difficulties  mentioned  above.  We  can 
be  much  more  efficient. 

THE  ALGORITHM.  Let  us  consider  the  following  situation:  We  have  a random 

sample  of  size  n from  a multivariate  distribution  of  dimension  k, 

and  we  want  to  generate  pseudorandom  vectors  from  the  underlying,  but  unknown, 
distribution  that  gave  rise  to  the  random  sample.  Since  we  do  not  know, 
and  usually  will  never  know,  the  form  of  this  distribution,  our  attack 
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should  be  empirical.  We  shall  endeavor  to  see  to  it  that  our  pseudorandom 
vectors  look  very  much  like  those  in  the  original  data  set.  In  so  doing,  we 
will  maintain  the  essential  structural  integrity  of  the  problem. 


We  now  direct  our  attention  to  the  mechanics  of  the  algorithm.  After 
carrying  out  a rough  rescaling  to  account  for  differing  variances  that  may 
exist  among  the  k variates,  we  select  at  random  one  of  the  n data  points, 
say  Xj,  from  the  data  base  and  then  proceed  to  determine  its  m-1  nearest 
neighbors.  The  nearest  neighbors  axe  determined  under  the  ordinary  Euclidean 
metric  and  the  value  of  m will  depend  upon  the  sample  size  n,  the  character- 
istics of  the  data,  and  can  best  be  determined  after  perusal  of  the  data. 

A conservative  estimate  would  be  to  choose  m = n/20. 


The  vectors  {X^}/_j  are  now  coded  about  the  sample  mean  X = 1/m  £ X^ 
to  yield  {X^}  = {X^  - X)j_j,  and  an  independent  random  sample  of  size  m is 
generated  from  the  uniform  distribution  U(l/m  - ^:L£l!Li2.  f i/ra  + |^3(m-l) 


Now  the  linear  combination 


m 


X’  » l U.X 


&=1 


w 


is  formed,  where  (u^),^  is  the  random  sample  from  the  U(l/m  - /*",  1/m  ♦ /•’). 
Finally  the  translation 


X = X’  + X 

restores  the  relative  magnitude,  and  X is  a pseudorandom  vector  which  we 
propose  to  be  representative  of  the  multivariate  distribution  that  provided 
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To  obtain  the  next  pseudorandom  vector  we  randomly  select  another  of 
the  n data  points  and  proceed  as  above. 

We  will  now  attempt  to  motivate  the  algorithm  by  considering  the  mathe- 
matics that  suggests  the  mechanics  that  we  have  just  outlined.  Consider 
the  distribution  of  Xj  and  its  m-1  nearest  neighbors: 

{ (xi£»x2fc*  ^£=1  = ^X£^£=l’  Let  oS  suPPose  tIiat  this  "truncated  set" 

of  random  observations  has  mean  vector  y and  covariance  matrix  o.  Let 
^u£^£-*l  k®  an  independent  random  sample  from  the  uniform  distribution 
U(l/ra  - /♦,  1/m  + /•*).  Then,  E(u^)  = 1/m,  Varfu^)  = (m-l)/m  , and 
Cov^,  u^)  = 0,  for  i / j. 

Forming  the  linear  combination 


m 

Z = £ U£X£ 

£=1  & 1 

we  have,  for  the  rth  component  z r = UjX^  + u2xr2  + ...  + u * , the  following 

relations 

E(zr)  = m • 1/m  * ur  = yr, 

Var(zr)  = oZ  + (m-l)/m  • yZ, 

Cov(zr,zs)  = ors  + (ra-l)/m  • y^,.. 

* 

Clearly,  if  the  mean  vector  of  X was  y = (0,0,  ...,0)  , then  the  mean  vector 
and  covariance  matrix  of  Z would  be  identical  to  those  of  X.  In  the  less 
idealized  situation  with  which  we  are  confronted,  the  translation  to  the 
sample  mean  of  the  nearest  neighbor  cloud  should  result  in  the  pseudoobserva- 
tion having  very  nearly  the  same  mean  and  covariance  structure  as  that  of  the 
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(truncated)  distribution  of  the  points  in  the  nearest  neighbor  cloud,  a con- 
jecture borne  out  in  many  actual  cases  that  have  been  considered.  For  m 
moderately  large,  our  algorithm  essentially  samples  from  n Gaussian  distribu- 
tions with  the  means  and  covariance  matrices  corresponding  to  those  of  the 
n m nearest  neighbor  clouds. 


EXAMPLES.  For  a substantial  test  case,  we  considered  a mixture  of  three  bi- 
variate normal  distributions.  The  first  (N^)  has  mean  vector  ("^J  and  covariance 

1 -1/2  -2 
matrix  C_jy2  1 t^e  seconc^  has  mean  vector  ( and  covariance  matrix 

1 1/2  2 
(l/2  ^ );  and  the  third  (N^)  has  mean  vector  an<*  covariance  matrix 

^1/10  *1*^'  The  corresponding  mixing  scalars  are  = 1/2,  a ^ = 1/3,  and 
= 1/6,  respectively.  Representative  contours  of  equal  density  are  illus- 
trated in  Figure  1.  To  establish  a data  base,  a sample  of  eighty-five  points 
was  generated  from  this  distribution  via  Monte  Carlo  simulation;  a sample  of 
eighty-five  pseudorandom  values  was  then  produced  by  the  algorithm,  and  the 
combined  sample  is  shown  in  Figure  2. 


Notice  that  the  structure  of  the  data  is  maintained  in  that  the  modes 
are  preserved;  the  algorithm  has  not  attempted  to  fill  in  gaps  where  gaps 
belong;  the  algorithm  has,  however,  generated  some  points  outside  the  boundary 
of  the  convex  hull  of  the  data  base,  all  of  which  are  desirable  properties. 
These  observations  lend  credence  to  the  term  "structural  integrity"  mentioned 
previously. 

An  application  of  the  algorithm  to  a real  world  data  set  is  summarized 
in  Tigures  3 and  4.  In  Figure  3,  a two-dimensional  marginal  of  a set  of  973 
four-dimensional  behind  armor  debris  measurements  is  portrayed;  in  Figure  4, 
973  simulated  data  points  produced  by  our  procedure.  Once  again,  the  salient 
features  of  the  data  set  are  preserved. 
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Fig.  1.  Density  contours  for  a mixture  of  three  bivariate  normal  distributions. 
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Fig.  2.  Combined  sample:  Data  base  and  Pseudoobservations 
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g.  3.  Marginal  data  for  4-dimensional  behind  armor  debris. 


CONCLUSIONS.  We  have  demonstrated  a means  of  empirical  random  number  genera- 
tion based  on  a sample  of  observations  of  a random  variable  X.  No  esti- 
mation of  the  underlying  density  is  required.  And,  because  of  the  local 
nature  of  the  generation  scheme,  it  is  essentially  free  of  assumptions  on 
the  underlying  density  of  X.  Naturally,  any  attempt  to  use  this  algorithm 
for  generating  bona  fide  new  observations  using  the  computer  rather  than 
producing  real  world  data  would  be  unwise.  Rather,  the  algorithm  operates 

somewhat  like  a smooth  interpolator highly  dependent  on  the  quality  of  the 

data  points  on  which  it  is  based.  It  gives  us  a means  of  avoiding  nonrobust 
conclusions  due  to  "holes"  in  the  data  set  at  important  points  of  the  simula- 
tion model. 

Also  included  in  Thompson's  presentation  was  a discussion  of  how  alterna- 
tives to  the  usual  (contoui  map)  density  estimators  may  be  constructed  based 
on  stochastic  interpolation. 


OHiGi^kL  fAui£  u 

OF  POOR  QUALITY 

REFERENCES 

1.  Bean,  Steven  J.  and  Tsokqs,  Chris  P.  (1980).  "Developments  in  nonpara- 
metric  density  estimation,"  International  Statistical  Review,  v.  48, 

pp.  267-287. 

2.  Fwu,  Chih-chy,  Tapia,  Richard  A.  and  Thompson,  James  R.  (1980).  "The 
nonparametric  estimation  of  probability  densities  in  ballistics  research," 
to  appear  in  the  Proceedings  of  the  Twenty-Sixth  Conference  on  the  Design 
of  Experiments  in  Army  Research,  Development  and  Testing. 

3.  Scott,  David  W. , Tapia,  Richard  A.  and  Thompson,  James  R.  (1978), 
"Multivariate  density  estimation  by  discrete  maximum  penalized  likeli- 
hood methods,"  Graphical  Representations  of  Multivariate  Data,  Peter  C. 
Wang,  ed..  Academic  Press,  pp.  169-182. 


itbb  157 


C 

u D 


226 


MIXTURE  DENSITIES,  MAXIMUM  LIKELIHOOD,  AND  THE  EM  ALGORITHM 

by 

Richard  A.  Redner 

Department  of  Mathematical  Sciences 
University  of  Tulsa 
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Abstract : The  problem  of  estimating  the  parameters  which 

determine  a mixture  density  has  been  the  subject  of  a large, 
diverse  body  of  literature  spanning  nearly  ninety  years. 

During  the  last  two  decades,  the  method  of  maximum-likelihood 
ha3  become  the  most  widely  followed  approach  to  this  problem, 
thanks  primarily  to  the  advent  of  high-speed  electronic  com- 
puters. Here,  we  first  offer  a brief  survey  of  the  literature 
directed  toward  this  problem  and  review  maximum-likelihood 
estimation  for  it.  We  then  turn  to  the  subject  of  ultimate 
interest,  which  is  a particular  iterative  procedure  for  numeri- 
cally approximating  maximum-likelihood  estimates  for  mixture 
density  problems.  This  procedure,  known  as  the  EM  algorithm, 
is  a specialization  to  the  mixture  density  context  of  a general 
algorithm  of  the  same  name  used  to  approximate  maximum-likeli- 
hood estimates  for  incomplete  data  problems.  We  discuss  the 
formualltion  and  theoretical  and  practical  properties  of  the  EM 
algorithm  for  mixture  densities,  focussing  in  particular  on 
mixtures  of  densities  from  exponential  families. 


Key  words  and  phrases:  Mixture  densities,  maximum-likelihood, 

EM  algorithm,  exponential  families,  incomplete  data. 
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1.  Introduction 

Of  interest  here  is  a parametric  family  of  finite  mixture 
densities,  i.e.,  a family  of  probability  density  functions  of  the 
form 


m 


p(xl$>)  - E » x “ (xi'"’'xn)T  e rD  ' f1*1) 


m 


where  each  c,  is  nonnegative  and  £ a.  - 1 , and  where  each 
1 i-1  1 


n, 


PA  is  itself  a density  function  parametrized  by  € n i £ R 

Wo  denote  <X>  - (a, , • • • ,a  ,*>.  , • • • ,p  ) and  set 

l ml  m 
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n - {(«!»••  “ 1 and  ai  > 0 • *i  6 nl 

for  i ~ I, • • • ,m)  . 

The  more  general  case  of  a possibly  infinite  mixture  density, 
expressible  as 

I p(xl*(X))dor(X)  , (1.2) 

JA 

is  not  considered  here,  even  though  much  of  the  following  is 

applicable  with  few  modifications  to  such  a density.  For  general 

references  dealing  with  infinite  mixture  densities  and  related 

densities  not  considered  here,  see  the  survey  of  Blischke  (12]. 

Also,  it  is  understood  that  in  determining  probabilities, 

probability  density  functions  are  integrated  with  respect  to  a 

measure  on  Rn  which  is  either  Lebesque  measure,  counting 

measure  on  some  finite  or  countably  infinite  subset  of  Rn  , or 

a combination  of  the  two.  In  the  following,  it  is  usually 

obvious  from  the  context  which  measure  on  Rn  is  appropriate 

for  a particular  probability  density  function,  and  30  measures  on 

Rn  are  not  specified  unless  there  is  a possibility  of 

confusion.  It  is  further  understood  that  the  topology  on  n is 

the  natural  product  topology  induced  by  the  topology  on  the  real 

numbers.  At  times  when  it  is  convenient  to  determine  thi3 

topology  by  a norm,  wo  will  regard  elements  of  fl  as 

m 

(m  + £ n.)  - vectors  and  consider  norms  defined  on  such 

i-1  1 


vectors . 
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Finite  mixture  densities  arise  naturally  - and  can  naturally 
be  interpreted  - as  densities  associated  with  a statistical 
population  which  is  a mixture  of  m component  populations  with 
associated  component  densities  ...  m ant*  mixing 

proportions  • • • m * Such  densities  appear  as  fundamental 

models  in  areas  of  applied  statistics  such  as  statistical  pattern 
recognition,  classification,  and  clustering.  (As  examples  of 
general  references  in  the  broad  literature  on  these  subjects,  we 
mention  Duda  and  Hart  (44],  Fukunaga  [48],  Hartigan  [62],  Van 
Ryzin  [128],  and  Young  and  Calvert  [138].  For  some  specific 
applications,  see,  for  example,  the  Special  Issue  on  Remote 
Sensing  of  the  Communications  in  Statistics  [32]).  In  addition, 
finite  mixture  densities  often  are  of  interest  in  life  testing 
and  acceptance  testing  (cf.  Cox  [34],  Hald  [60],  Mendenhall  and 
Hader  [89],  and  other  authors  referred  to  by  Blischke  [12]). 
Finally,  many  scientific  investigations  involving  statistical 
modeling  require  by  their  very  nature  the  consideration  of 
mixture  populations  and  their  associated  mixture  densities.  The 
example  of  Hosmer  [68]  below  is  simple  but  typical.  For 
references  to  other  examples  in  Fishery  studies,  genetics, 
medicine,  chemistry,  psychology,  and  other  fields,  see  Bll3chke 
[12],  Everitt  and  Hand  [45],  and  Hosmer  [67]. 

Example : According  to  the  International  Halibut  Commission  of 

Seattle,  Washington,  the  length  distribution  of  Halibut  of  a 
given  age  is  closely  approximated  by  a mixture  of  two  normal 
distributions  corresponding  to  the  length  distributions  of  the 
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male  and  female  subpopulations.  Thus  the  length  distribution  is 
modeled  by  a mixture  density  of  the  form 

p(xl«>)  - + a2 p2(xU2)  , x e R1  , (1.3) 

where  for  i - 1,2  , 

I 1 /2o^  2 T 2 

P Axlp.) e 1 , e ,(1.4) 

1 1 \l2iroi  1 11 

and  4>  - (a^f  a^,^,^)  • Suppose  that  one  would  like  to  estimate 
G>  on  the  basis  of  some  sample  of  length  measurements  of  halibut 
of  a given  age.  If  one  had  a large  sample  of  measurements  which 
were  labeled  according  to  sex,  then  it  would  be  an  easy  and 
straightforward  matter  to  obtain  a satisfactory  estimate  of  <J>  . 
Unfortunately,  it  is  reported  in  [68]  that  the  sex  of  halibut 
cannot  be  easily  (i.e.,  cheaply)  determined  by  humans;  therefore, 
as  a practical  matter,  it  is  likely  to  be  necessary  to  estimate 
«>  from  a sample  in  which  the  majority  of  members  are  not  labeled 
according  to  sex. 

Regarding  p in  (1.1)  as  modeling  a mixture  population,  we 
say  that  a sample  observation  on  the  mixture  is  labeled  if  its 
component  population  of  origin  is  known  with  certainty; 
otherwise,  we  say  that  it  is  unlabeled . The  example  above 
illustrates  the  central  problem  with  which  we  ire  concerned  here, 
namely  that  of  estimating  <£  in  (1.1)  using  a sample  in  which 
some  or  all  of  the  observations  are  unlabeled.  This  problem  is 
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referred  to  in  the  following  a3  the  mixture  density  estimation 
problem.  (For  simplicity,  we  do  not  consider  here  the  problem  of 
estimating  not  only  O but  also  the  number  m of  component 
populations  in  the  mixture.)  A variety  of  cases  of  this  problem 
and  several  approaches  to  its  solution  have  been  the  subject  of 
or  at  least  touched  on  by  a large,  diverse  3et  of  papers  spanning 
nearly  ninety  years.  We  begin  by  offering  in  the  next  section  a 
cohesive  but  very  sketchy  review  of  those  papers  of  which  we  are 
aware  which  have  as  their  main  thrust  some  aspect  of  this  problem 
and  its  solution.  It  is  hoped  that  this  survey  will  provide  both 
some  perspective  in  which  to  view  the  remainder  of  this  paper  and 
a starting  point  for  those  who  wish  to  explore  the  literature 
associated  with  this  problem  in  greater  depth. 

Following  the  review  in  the  next  section,  we  discuss  at  some 
length  the  method  of  maximum-likelihood  for  the  mixture  density 
estimation  problem.  In  rough  general  terms,  a maximum-likelihood 
estimate  of  a parameter  which  determines'  a density  function  is  a 
choice  of  the  parameter  which  maximizes  the  induced  density 
function  (called  in  this  context  the  likel ihood  function)  of  a 
given  sample  of  observations.  Maximum-likelihood  estimation  has 
been  the  approach  to  the  mixture  density  estimation  problem  most 
widely  considered  in  the  literature  since  the  use  of  high  speed 
electronic  computers  became  widespread  in  the  1960's.  in  Section 
3,  the  maximum-likelihood  estimates  of  interest  here  are  defined 
precisely,  and  both  their  important  theoretical  properties  and 
aspects  of  their  practical  behavior  are  summarized. 
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The  remainder  of  the  paper  is  devoted  to  the  subject  of 
ultimate  interest  here,  which  is  a particular  iterative  procedure 
for  numerically  approximating  maximum-likelihood  estimates  of  the 
parameters  in  mixture  densities.  This  procedure  is  a 
specialization  to  the  mixture  density  estimation  problem  of  a 
general  method  for  approximating  maximum-likelihood  estimates  in 
an  incomplete  data  context  which  was  formalized  by  Demptster , 
Laird,  and  Rubin  [38]  and  termed  by  them  the  EM  algor ithm  (E  for 
"expectation"  and  M for  "maximization").  The  EM  algorithm  for 
the  mixture  density  estimation  problem  has  been  studied  by  many 
authors  over  the  last  two  decades.  In  fact,  there  have  been  a 
number  of  independent  derivations  of  the  algorithm  from  at  least 
two  quite  distinct  points  of  view.  It  ha3  been  found  in  most 
instances  to  have  the  advantages  of  reliable  global  convergence, 
low  cost  per  iteration,  economy  of  storage,  and  ease  of 
programming  a3  well  as  a certain  heuristic  appeal.  On  the  other 
hand,  it  can  also  exhibit  hopelessly  slow  convergence  in  some 
seemingly  innocuous  applications.  All  in  all,  it  is  undeniably 
of  considerable  current  interest,  and  it  seems  likely  to  play  an 
Important  role  in  the  mixture  density  estimation  problem  for  some 
time  to  come. 

We  feel  that  the  point  of  view  toward  the  EM  algorithm  for 
mixture  densities  advanced  in  [38]  greatly  facilitates  both  the 
formulation  of  a general  procedure  for  prescribing  the  algorithm 
and  the  understanding  of  the  important  theoretical  properties  of 
the  algorithm.  Our  objectives  in  the  following  are  to  present 
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this  point  of  view  in  detail  in  the  mixture  density  context,  to 
unify  and  extend  the  diverse  results  in  the  literature  concerning 
the  derivation  and  theoretical  properties  of  the  EM  algorithm, 
and  to  review  and  add  to  what  is  known  about  its  practical 
behavior . 

In  Section  4,  we  interpret  the  mixture  density  estimation 
problem  as  an  incomplete  data  problem,  formulate  the  general  EM 
algorithm  for  mixture  densities  from  this  point  of  view,  and 
discuss  the  general  properties  of  the  algorithm.  In  Section  5, 
the  focus  is  narrowed  to  mixtures  of  densities  from  the 
exponential  family,  and  we  summarize  and  augment  the  results  of 
investigations  of  the  EM  algorithm  for  such  mixtures  which  have 
appeared  In  the  literature.  Finally,  in  Section  6.  we  discuss 
the  per  forma  of  the  algorithm  m practice  through  qualitative 
comparisons  ‘ jther  algorithms  and  numerical  studies  in  simple 
but  important  cases. 


2 . A Review  of  the  Literature 

The  following  is  a skeletal  survey  of  papers  which  are 
primarily  directed  toward  some  part  of  the  mixture  density 
estimation  problem.  No  attempt  has  been  made  to  include  papers 
which  are  strictly  concerned  with  applications  of  estimation 
procedures  and  results  developed  elsewhere.  For  additional 
references  relating  to  mixture  densities  as  well  a3  more  detailed 
summaries  of  the  contents  of  many  of  the  papers  touched  on  below, 
we  refer  the  reader  to  the  recently  published  monograph  by 
Everitt  and  Hand  [45,;.  As  a convenience,  this  survey  has  been 
divided  somewhat  arbitrarily  by  topics  into  four  subsections. 
Not  surprisingly,  many  papers  are  cited  in  more  than  one 
subsection. 

2.1  The  method  of  moments . 

The  first  published  investigation  relating  to  the  mixture 
density  estimation  problem  appears  to  be  that  of  Pearson  [97]. 
In  that  paper,  as  in  Example  1.1,  the  problem  considered  i3  the 
estimation  of  the  parameters  in  a mixture  of  two  univariate 
normal  densities.  The  sample  from  which  the  estimates  are 
obtained  is  assumed  to  be  independent  and  to  consist  entirely  of 
unlabeled  observations  on  the  mixture.  (Since  this  is  the  sort 
of  sample  dealt  with  m the  vast  majority  of  work  on  the  problem 
at  hand,  it  is  understood  in  this  review  that  all  samples  are  of 
this  type  unle33  otherwise  indicated.)  The  approach  suggested  by 
Pearson  for  solving  the  problem  is  known  as  the  method  of 


moments.  The  method  of  moments  consists  generally  of  equating 
some  set  of  sample  moments  to  their  expected  values  and  thereby 
obtaining  a system  of  (generally  nonlinear)  equations  for  the 
parameters  in  tt j mixture  density.  To  estimate  the  five 
independent  parameters  in  a mixture  of  two  univariate  normal 
densities  according  to  the  procedure  of  [97],  one  begins  with 
equations  determined  by  the  first  five  moments  and,  after 
cons'derable  algebraic  manipulation,  ultimately  arrives  at 
expressions  for  estimates  which  depend  on  a suitably  chosen  root 
of  a single  ninth  degree  polynomial. 

From  the  time  of  the  appearance  of  Pearson's  paper  until  the 
use  of  high  3peed  electronic  computers  became  widespread  in  the 
1960’s,  only  fairly  simple  mixture  density  estimation  problems 
were  studied,  and  the  method  of  moments  was  usually  the  method  of 


choice 

for 

their 

solution. 

During  thi3  period, 

most 

energy 

devoted 

to 

mixture 

problems 

was  directed  toward 

mixtures  of 

normal 

densities, 

especially 

toward  Pearson's  case 

of  two 

univar iate 

normal 

densities. 

Indeed,  most  work 

on 

normal 

mixtures  during  thi3  period  was  intended  either  to  simplify  the 
job  of  obtaining  Pearson's  estimates  or  to  offer  more  accessible 
estimates  in  restricted  cases.  Charlier  [24]  described  the 
implementation  of  Pearson's  method  as  "an  heroic  task",  and 
suggested  a somewhat  simpler  method  of  solving  the  moment 
equations  which  involves  a cubic  and  ratio  of  two  other 
polynomials.  Pearson  and  Lee  [99]  recommended  using  "incomplete" 
normal  moment  functions  to  obtain  first  approximations  to  the 


i 
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roots  of  the  nonic  equation  produced  by  the  procedure  of  Pearson 
[97].  Charlier  and  Wicksell  [25]  further  simplified  the  method 
of  Pearson  [97],  suggested  graphical  methods  for  obtaining  roots 
of  the  nonic,  and  studied  estimates  which  can  be  obtained 
relatively  easily  under  the  assumption  of  known  means,  equal 
variances,  or  symmetry  of  the  mixture  density.  Burrau  [18] 
computed  certain  "half- invar iant"  functions  of  the  moments, 
thereby  obtaining  new  equations  for  the  five  unknown  parameters? 
convenient  methods  for  the  solution  of  these  equations  are 
offered  m the  companion  paper  of  Stromgren  [119].  Gottschalk 
[51]  exploited  symmetry  to  obtain  simple  equations  satisfied  by 
the  moment  estimates  for  a symmetric  mixture  of  two  univariate 
normal  densities.  Graphical  aids  for  obtaining  Pearson’s  moment 
estimates  were  derived  by  Sittig  [116],  Wiechselberger  [131],  and 
Preston  [104].  Cohen  [31]  suggested  circumventing  the  solution 
of  Pearson's  nonic  equation  via  an  iteration  which  involves 
solving  a cubic  equation  at  each  step.  An  independent  sample 
from  one  component  of  the  mixture  was  used  by  Dick  and  Bowden 
[42]  to  estimate  one  mean  and  one  variance,  thereby  reducing  to 
three  the  number  of  parameters  to  be  estimated  from  an  unlabeled 
sample  on  the  mixture;  their  estimates  were  used  as  initial 
approximations  m an  iterative  procedure  for  approximating 
maximum-likelihood  estimates.  Gridgeman  [53]  discussed  moment 
estimates  of  the  variances  and  the  mixing  proportion  under  the 
assumption  of  a common  mean.  Robertson  and  Fryer  [113]  and  Fryer 
and  Robertson  [47]  studied  the  statistical  properties  of  the 
moment  estimates  and  compared  them  to  the  multinomial  maximum- 
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likelihood  and  minimum  chi-square  estimates  obtained  by  grouping 
the  sample  observations.  Assuming  equal  variances,  Tan  and  Chang 
[121]  compared  the  efficiency  of  tho  moment  and  maximum- 
likelihood  estimates  by  computing  the  asymptotic  variances  of  the 
estimates.  The  space  of  acceptable  solutions  of  the  moment 
equations  was  described  by  Bowman  and  Shenton  [16].  Finally, 
Quandt  and  Ramsey  [106]  compared  moment  estimates  with  the 
estimates  produced  by  their  moment  generating  function  method,  of 
which  we  say  more  later. 

Some  work  has  been  done  extending  Pearson's  method  of 
moments  to  more  general  mixtures  of  normal  densities  and  to 
mixtures  of  other  continuous  densities.  Pollard  [103]  obtained 
moment  estimates  for  a mixture  of  three  univariate  normal 
densities  by  assuming  symmetry  and  other  simplifying  features 
which  reduce  the  number  of  unknown  parameters  to  four.  The 
problem  of  obtaining  moment  estimates  for  mixtures  of 
multivariate  normal  densities  was  considered  by  Cooper  [33]. 
Assuming  equal  mixing  proportions  for  simplicity,  he  explored 
both  the  two-component  case  involving  general  component 
covariance  matrices  and  the  multiple-component  case  for 
spherically  symmetric  component  densities.  Day  [36]  investigated 
moment  estimates  for  a mixture  of  two  multivariate  normal 
densities  with  a common  covariance  matrix.  Gumbel  [54]  derived 
moment  estimates  for  the  means  in  a mixture  of  two  exponential 
densities  under  the  assumption  that  the  mixing  proportions  are 
known.  The  results  of  [54]  were  extended  by  Rider  [111]  to 


include  estimates  of  unknown  proportions  as  well  as  means. 
Later,  Rider  offered  moment  estimates  for  mixtures  of  Weibull 
distributions  in  [112). 

Moment  estimates  for  a variety  of  simple  mixtures  of 
discrete  densities  were  derived  more  or  less  in  parallel  with 
moment  estimates  for  mixtures  of  normal  and  continuous  densities. 
Pearson  [98]  constructed  moment  estimates  for  a mixture  of  two 
binomial  densities  of  common  unknown  power  and  for  a mixture  of 
two  Poisson  densities.  Muench  [90]  published  simpler  estimates 
for  a mixture  of  two  binomial  densities  of  known  power;  in  [91], 
he  sketched  the  extension  of  the  results  of  [98]  and  [90]  to 
mixtures  of  any  number  of  Poisaon  densities  or  binomial  densities 
of  common  known  power.  Later,  the  moment  estimates  for  a mixture 
of  two  Poisson  densities  were  independently  re-derived  by 
Schilling  [115].  In  the  case  of  known  mixing  proportions,  the 
moment  estimates  for  a mixture  of  two  Poisson  densities  were 
obtained  independently  by  Gumbel  [54]  and  Arley  and  Buch  [3]. 
Further  independent  reconstruction  and  extension  of  earlier  work 
was  done  by  Rider  [112]  and  Blischke  [11].  In  [112],  moment 
estimates  are  derived  for  mixtures  of  two  of  either  the  Poisson, 
binomial,  negative  binomial,  or  (as  mentioned  above)  Weibull 
densities.  In  a construction  paralleling  that  of  [112],  moment 
estimates  are  given  m [11]  for  a mixture  of  two  binomial 
densities  of  common  knov/n  power;  in  addition,  properties  of  these 
estimates  such  a3  their  limiting  distributions  and  asymptotic 
relative  efficiencies  are  considered.  Tho  results  of  Rider  [112] 
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were  simplified  through  the  use  of  factorial  rather  than  ordinary 
moments  and  extended  to  include  certain  alternative  estimates  and 
additional  mixtures  by  Cohen  [29].  Following  the  outline  of 
Muench  [91],  Blischke  [13]  extended  the  results  of  [11]  to  give 
moment  estimates  for  a mixture  of  any  number  of  binomial 
densities  of  common  known  power.  For  additional  information  on 
moment  estimation  and  many  other  topics  of  interest  for  mixtures 
of  discrete  distributions,  we  refer  the  reader  to  the  extensive 
survey  of  Blischke  [12], 

Before  leaving  the  method  of  moments,  we  mention  the 
important  problem  of  estimating  the  proportions  alone  in  a 
mixture  density  under  the  assumption  that  the  component 
densities,  or  at  least  some  useful  statistics  associated  with 
them,  are  known.  Most  general  mixture  density  estimation 
procedures  can  be  brought  to  bear  on  this  problem,  and  the  manner 
of  applying  these  general  procedures  to  this  problem  is  usually 
independent  of  the  particular  forms  of  the  densities  in  the 
mixture.  In  addition  to  the  general  estimation  procedures,  a 
number  of  special  procedures  have  been  developed  for  this 
problem;  these  are  discussed  in  the  third  subsection  of  this 
review.  The  method  of  moments  has  the  attractive  property  for 
this  problem  that  the  moment  equations  are  linear  in  the  mixture 
proportions.  Moment  estimates  of  proportions  were  discussed  by 
Odell  and  Basu  [92].  The  sensitivity  of  moment  estimates  and 
other  proportion  estimates  to  changes  in  location  of  the 
component  densities  was  studied  by  Tubbs  and  Coberly  [127]. 


<L.->  V 


2.2  The  method  of  maximum  likelihood. 

With  the  arrival  of  increasingly  powerful  computers  and 
Increasingly  sophisticated  numerical  methods  during  the  1960's, 
investigators  began  to  turn  from  the  method  of  moments  to  the 
method  of  maximum  likelihood  as  the  most  widely  preferred 
approach  to  mixture  density  estimation  problems.  To  reiterate 
the  working  definition  given  in  the  introduction,  we  say  that  a 
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estimate 

associated 

with 

a sample 

of 

observations 
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context  the  likelihood  function.  In  the  next  section,  we  define 
precisely  the  maximum-likelihood  estimates  of  interest  here  and 
comment  on  their  properties.  In  this  subsection,  we  offer  a very 
brief  tour  of  the  literature  addressing  maximum-likelihood 
estimation  for  mixture  densities.  Of  course,  more  is  said  in  the 
sequel  about  most  of  the  work  mentioned  below. 

Actually,  maximum-likelihood  estimates  and  their  associated 
efficiency  were  often  the  subject  of  wishful  thinking  prior  to 
the  advent  of  computers,  and  some  work  was  done  then  toward 
obtaining  maximum-likelihood  estimates  for  simple  mixtures. 
Specifically,  Baker  [4]  obtained  maximum-likelihood  estimates  of 
the  ratio  of  the  proportions  both  in  a mixture  of  two  essentially 
arbitrary  univariate  densities  for  samples  of  sizes  two  and  three 
and  in  a nwxture  of  two  univariate  densities  which  are  uniform 
over  intervals  for  arbitrary  sample  sizes.  Also,  Rao  [107] 
considered  a mixture  of  two  univariate  normal  densities  with 
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equal  variances  and  specified  the  likelihood  equations,  a system 
of  four  equations  satisfied  by  the  four  unknown  parameters  at  the 
maximum-likelihood  estimate.  He  suggested  solving  the  likelihood 
equations  numerically  with  an  iterative  procedure  known  as  the 
method  of  scoring,  which  we  describe  in  Section  6.  Finally, 
Mendenhall  and  Hader  [89]  obtained  maximum-likelihood  estimates 
of  the  parameters  in  a mixture  of  two  exponential  densities  using 
a sample  in  which  some  of  the  observations  are  labeled.  They 
reduced  the  problem  of  obtaining  the  estimates  to  that  of  solving 
a single  nonlinear  equation  in  one  unknown;  a numerical  solution 
of  this  equation  was  found  using  Newton's  method.  Despite  thi3 
early  work,  however,  the  problem  of  obtaining  maximum-likelihood 
estimates  was  generally  considered  during  this  period  to  be 
completely  intractable  for  computational  reasons. 

As  computers  became  available  to  ease  the  burden  of 
computation,  maximum-likelihood  estimation  was  proposed  and 
studied  in  turn  for  a variety  of  increasingly  complex  mixture 
densities.  As  before,  mixtures  of  normal  densities  were  the 
subject  of  considerable  attention.  Hasselblad  [64]  treated 
maximum-likelihood  estimation  for  mixtures  of  any  number  of 
univariate  normal  densities;  his  major  results  were  later 
obtained  independently  by  Behboodian  [70].  Mixtures  of  two 
multivariate  normal  densities  with  a common  unknown  covariance 
matrix  were  addressed  by  Day  [36].  The  general  case  of  a mixture 
of  any  number  of  multivariate  normal  densities  was  considered  by 
Wolfe  [132],  and  additional  work  on  this  case  was  done  by  Duda 
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and  Hart  [44]  and  Peters  and  Walker  [101].  Tan  and  Chang  [121] 
compared  the  moment  and  maximum-likelihood  estimates  for  a 

I 

mixture  of  two  univariate  normi  l densities  with  common  variance 
by  computing  the  asymptotic  variances  of  the  estimates;  they 
found  that  maximum-likelihood  estimates  are  much  better, 

i 

especially  when  the  component  densities  are  poorly  separated. 
Hosmer  [67]  reported  on  a Monte  Carlo  study  of  maximum-likelihood 
estimates  for  a mixture  of  two  univariate  normal  densities  when 
the  component  densities  are  not  well  separated  and  the  sample 
size  iB  small;  the  results  of  his  study  suggest  that  the  method 
of  maximum-likelihood  should  be  used  with  considerable  caution  is 

I 

such  cases.  ’ 

) 

Several  interesting  variations  on  the  usual  estimation 

problem  for  mixtures  of  normal  densities  have  been  addressed  in 

the  literature.  Hosmer  [68]  compare^  the  maximum-likelihood 

estimates  for  a mixture  of  two  univariate  normal  densities 

* 

obtained  from  three  different  types  of  samples,  the  first  of 
which  is  the  usual  type  consisting  of  only  unlabeled  observations 
and  the  second  two  of  which  consist  of  both  labeled  and  unlabeled 
observations  and  are  distinguished  by  whether  or  not  the  labeled 
observations  contain  information  about  the  mixing  proportions. 

I 

(We  elaborate  on  the  nature  of  these  samples  and  how  they  might 

l 

arise  in  Section  2.)  Earlier,  Tan  and  Chang  [120]  considered  a 

l 

problem  from  genetics  which  is  nearly  identical  to  that 
considered  by  Hosmer  [68]  for  partially  labeled  samples  which 
contain  no  information  about  the  mixing  proportions.  Also,  Dick 
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and  Bowden  [42]  independently  addressed  a special  case  of  this 
problem  in  which  maximum-likelihood  estimates  are  obtained  using 
a sample  of  labeled  observations  from  one  component  population 
together  with  a sample  of  unlabeled  observations  on  the  mixture. 
Finally,  a number  of  authors  have  investigated  maximum-likelihood 
estimates  for  a "switching  regression"  model  which  is  a certain 
type  of  estimation  problem  for  mixtures  of  normal  densities;  see 
the  papers  of  Quandt  [105],  Hosmer  [69],  Kiefer  [77],  and  the 
comments  by  Hartley  [63],  Hosmer  [70],  and  Kiefer  [78]  on  the 
paper  of  Quandt  and  Ramsey  [106].  A generalization  of  the  model 
considered  by  these  authors  was  touched  on  by  Denni3  [39]. 

Maximum-likelihood  estimation  has  also  been  studied  for  a 
variety  of  unusual  and  general  mixture  density  problems,  some  of 
which  include  but  are  not  restricted  to  the  usual  normal  mixture 
problem.  Cohen  [30]  considered  an  unusual  but  simple  mixture  of 
two  discrete  densities,  one  of  which  has  support  at  a single 
point;  he  focused  in  particular  on  the  case  in  which  the  other 
density  is  a negative  binomial  density.  Hasselblad  [65] 
generalized  his  earlier  results  in  [64]  to  include  mixtures  of 
any  number  of  univariate  densities  from  exponential  families.  He 
included  a 3hort  study  comparing  maximum-likelihood  estimates 
with  the  moment  estimates  of  Blischke  [13]  for  a mixture  of  two 
binomial  distributions.  Baum,  Petrie,  Soules,  and  Weiss  [7] 
addressed  a mixture  estimation  problem  which  is  both  unusual  and 
in  one  respect  more  general  than  the  problems  considered  in  the 
sequel.  In  their  problem,  the  a pr lor  1 probabilities  of  sample 
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observations  coming  from  the  various  component  populations  in  the 
mixture  are  not  independent  from  one  observation  to  the  next 
(that  is,  they  are  not  simply  the  proportions  of  the  component 
populations  in  the  mixture)  but  rather  are  specified  to  follow  a 
Markov  chain.  Their  results  are  specifically  applied  to  mixtures 
of  univariate  normal,  gamma,  binomial,  and  Poisson  densities  and 
to  mixtures  of  general  strictly  log  concave  density  functions 
which  are  identical  except  for  unknown  location  and  scale 
parameters.  Peters  and  Coberly  [100]  and  Peters  and  Walker  [102] 
treated  maximum-likelihood  estimates  of  proportions  and  subsets 
of  proportions  for  essentially  arbitrary  mixture  densities. 
Maximum-likelihood  estimates  were  included  by  Tubbs  and  Coberly 
[127]  in  their  study  of  the  sensitivity  of  various  proportion 
estimators.  Other  maximum-likelihood  estimation  problems  which 
are  closely  related  to  those  considered  here  are  the  latent 
structure  problems  touched  on  by  Wolfe  [132]  (see  also  Lazarsfeld 
and  Henry  [81])  and  the  problems  concerning  frequency  tables 
derived  by  indirect  observation  addressed  by  Haberman  [57],  [58], 
[59].  Finally,  although  infinite  mixture  densities  of  the 
general  form  (1.2)  are  specifically  excluded  from  consideration 
here,  we  mention  a very  interesting  result  of  Laird  [80]  to  the 
effect  that  under  various  assumptions,  the  maximum-likelihood 
estimate  of  a possibly  infinite  mixture  density  is  actually  a 
finite  mixture  density. 


245 


2.3  Other  methods. 

In  addition  to  the  method  of  moments  and  the  method  of 
maximum-likelihood,  a variety  of  other  methods  have  been  proposed 
for  estimating  parameters  in  mixture  densities.  Some  of  these 
methods  are  general  purpose  methods.  Others  are  (or  were  at  the 
time  of  their  derivation)  intended  for  mixture  problems  the  forms 
of  which  make  (or  made)  them  either  ill-suited  for  the 
application  of  more  widely  used  methods  or  particularly  well- 
suited  for  the  application  of  special  purpose  methods. 

For  mixtures  of  any  number  of  univariate  normal  densities, 

\ 

Harding  [61]  and  Cassie  [19]  suggested  graphical  procedures 
employing  probability  paper  as  an  alternative  to  moment 
estimates,  which  were  at  that  time  practically  unobtainable  in 
all  but  the  simplest  cases.  Later,  Bhattacharya  [10]  prescribed 
other  graphical  methods  as  a particularly  simple  way  of  resolving 
a mixture  density  into  normal  components.  These  graphical 
procedures  work  best  on  mixture  populations  which  are  well- 
separated  in  the  sense  that  each  component  has  an  associated 
region  in  which  the  presence  of  the  other  components  can  be 
ignored . 

Also  for  general  mixtures  of  univariate  normal  densities, 
Doetsch  [43]  exhibited  a linear  operator  which  reduces  the 
variances  of  the  component  densities  without  changing  their 
proportions  or  means  and  used  this  operator  in  a pucedure  which 
determines  the  component  densities  one  at  a time.  Medgyessy  [88] 
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(see  also  the  review  by  Mallows  [86])  extended  the  techniques  of 
[43]  to  a large  class  of  univariate  mixture  densities  subject  to 
the  restriction  that  each  component  density  have  no  more  than  two 
unknown  parameters.  Gregor  [52]  prescribed  an  algorithm  for 
implementing  the  methods  of  Doetsch  [43]  and  Hedgyesay  [88]  on  a 
mixture  of  univariate  normal  densities.  Stanat  [117]  broadened 
the  methods  of  [43]  and  [88]  to  study  mixtures  of  multivariate 
normal  and  Bernoulli  densities.  In  [114],  Sammon  considered  a 
mixture  density  consisting  of  an  unknown  number  of  component 
densities  which  are  identical  except  for  translation  by  unknown 
location  parameters;  he  derived  techniques  based  on  convolution 
for  estimating  both  the  number  of  components  in  the  mixture  and 
the  location  parameters. 

A number  of  specialized  procedures  have  been  developed  for 
application  to  the  problem  of  estimating  the  proportions  in  a 
mixture  under  the  assumption  that  something  about  the  component 
densities  is  known.  Choi  and  Bulgren  [28]  proposed  an  estimate 
determined  by  a least-squares  criterion  in  the  spirit  of  the 
minimum-distance  method  of  Wolfowitz  [133].  A variant  of  the 
method  of  [2S]  for  which  smaller  bias  and  mean-square  error  were 
reported  was  offered  by  Macdonald  [84].  A method  termed  the 
confusion  matrix  method  was  given  by  Odell  and  Chhikara  [93]  (see 
also  the  review  of  Odell  and  Bacu  [92]).  In  this  method,  an 
estimate  is  obtained  by  subdividing  Rn  into  disjoint  regions 

R and  then  solvin',  the  equution  Pa  - e , in  which  a is 
i m 

the  estimated  vector  of  proportions,  e is  a vector  whose  i*"*1 
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component  is  the  fraction  of  observations  falling  in  , and 

i.  t 

the  "confusion  matrix"  P has  ij  entry 

Ri 

The  confusion  matrix  method  is  a special  case  of  a method  of 
Macdonald  [85],  whose  formulation  of  the  problem  as  a least- 
squares  problem  allows  for  a singular  or  rectangular  confusion 
matrix.  Earlier,  special  cases  of  estimates  of  this  type  were 
considered  by  Boes  [14],  [15].  Guseman  and  Walton  [55],  [56] 
employed  certain  pattern  recognition  notions  and  techniques  to 
obtain  numerically  tractable  confusion  matrix  proportion 
estimates  for  mixtures  of  multivariate  normal  densities.  James 
[73]  studied  several  simple  confusion  matrix  proportion  estimates 
for  a mixture  of  two  univariate  normal  densities.  Ganesalingam 
and  McLachlan  [49]  compared  the  performance  of  confusion  matrix 
proportion  estimates  with  maximum-likelihood  proportion  estimates 
for  a mixture  of  two  multivariate  normal  densities.  Finally,  we 
mention  that  Walker  [130]  considered  a mixture  -f  two  e-sentially 
arbitrary  multivariate  densities  and,  assuming  only  that  the 
means  of  the  component  densities  are  knom,  suggested  a simple 
procedure  using  linear  maps  which  yields  unbiased  proportion 
estimates . 

A stochastic  approximation  algorithm  for  estimating  the 
parameters  m a mixture  of  any  number  of  univariate  normal 
densities  was  offered  by  Young  and  Coraluppi  [139].  In  such  an 
algorithm,  one  determines  a sequence  of  recursively  updated 


estimates  from  a sequence  of  observations  of  indeterminate  length 
considered  on  a one-at-a-time  or  f ew-at-a-time  basis.  Such  an 
algorithm  is  likely  to  be  appealing  when  a sample  of  desired  size 


is  either  unavailable 

in 

coto 

at  any  one 

point  in  time 

unwieldy  because  of 

its 

size . 

Stochastic 

approximation 

mixture  proportions  alone  was  considered  by  Kazakos  [76]. 

Quandt  and  Ramsey  [106]  derived  a procedure  called  the 
moment  generating  function  method  and  applied  it  to  the  problem 
of  estimating  the  parameters  in  a mixture  of  two  univariate 
normal  densities  and  in  a switching  regression  model.  In  brief, 
a moment  generating  function  estimate  is  a choice  of  parameters 
which  minimizes  a certain  sum  of  squares  of  differences  between 
the  theoretical  and  sample  moment  generating  functions.  In  a 
comment  by  Kiefer  [78],  it  is  pointed  out  that  the  moment 
generating  function  method  can  be  regarded  a3  a natural 
generalization  of  the  method  of  moments.  Kiefer  [78]  further 
offers  an  appealing  heuristic  explanation  of  tho  apparent 
superiority  of  moment  generating  function  estimates  over  moment 
estimates  reported  by  Quandt  and  Ramsey  [106].  In  a comment  by 
Hosmer  [70],  evidence  is  presented  that  moment  generating 
function  estimates  may  in  fact  perform  better  than  maximum- 
likelihood  estimates  in  the  small-sample  case.  The  moment 
generating  function  method  appears  to  be  a potentially  valuable 
tool  in  mixture  density  estimation  problems. 

Minimum  chi-square  estimation  is  a general  method  of 
estimation  which  has  been  touched  on  by  a number  of  authors  in 
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connection  with  the  mixture  density  estimation  problem  but  which 
has  not  become  the  subject  of  much  consideration  in  depth  in  this 
context.  In  minimum  chi-square  estimation,  one  subdivides  Rn 
into  cells  and  8ee^8  & choice  of  parameters  which 
minimizes 
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or  some  similar  criterion  function.  In  this  expression,  and 

(<I>)  are,  respectively,  the  observed  and  expected  numbers  of 
observations  in  for  j l,***,k  . For  mixtures  of  normal 

densities,  minimum  chi-sjuare  estimates  were  mentioned  by 
Hasselblad  [64],  Cohen  [31],  Day  [36],  and  Fryer  and  Robertson 
[47].  Minimum  chi-square  estimates  of  proportions  were  reviewed 
by  Odell  and  Basu  [92]  and  included  in  the  sensitivity  study  of 
Tubbs  and  Coberly  [127].  Macdonald  [85]  remarked  that  his 
weighted  least-squares  approach  to  proportion  estimation 
suggested  a convenient  iterative  method  for  computing  minimum 
chi-square  estimates. 


As  a final  note,  we  mention  three  methods  which  have  been 
proposed  for  general  mixture  density  estimation  problems.  Choi 
[27]  discussed  the  extension  to  general  mixture  density 
estimation  problems  of  the  least-squares  method  of  Choi  and 
Bulgren  [28]  for  estimating  proportions.  Deely  and  Kruse  [37] 
suggested  an  estimation  procedure  which  is  in  spirit  like  that  of 
Choi  and  Bulgren  [28]  and  Choi  [27],  except  that  a sup-norm 
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distance  is  used  m place  of  the  square  integral  norm.  Deely  and 
Kruse  argued  that  their  procedure  is  computationally  feasible, 
but  no  concrete  examples  or  computation  results  are  given  in 
[37].  Yakowitz  [135],  [136]  outlined  a very  general  "algorithm" 
for  constructing  consistent  estimates  of  the  parameters  in 
mixture  densities  which  are  identifiable  in  the  sense  described 
in  the  fifth  subsection  of  this  review.  The  sense  in  which  his 
“algorithm”  is  really  an  algorithm  in  the  usually  understood 
sense  of  the  word  is  discussed  in  [136]. 


2.4  The  EM  algor ithm. 


At  several  points  in  the  review  above,  we  have  alluded  to 
computational  difficulties  associated  with  obtaining  maximum- 
likelihood  estimates.  For  mixture  density  problems,  these 
difficulties  arise  because  of  the  complex  dependence  of  the 
likelihood  function  on  the  parameters  to  be  estimated.  The 
customary  way  of  finding  a maximum- likelihood  estimate  is  first 
to  determine  ^ system  of  equations  called  the  likelihood 
equations  which  are  satisfied  by  the  maximum-likelihood  estimate 
and  then  to  attempt  to  find  the  maximum-likelihood  estimate  by 
solving  these  likelihood  equations.  The  likelihood  equations  are 
usually  found  by  differentiating  the  logarithm  of  the  likelihood 
function,  setting  the  derivatives  equal  to  zero,  and  perhaps 
performing  some  additional  algebraic  manipulations.  For  mixture 
density  problems,  the  likelihood  equations  are  almost  certain  to 
be  nonlinear  and  beyond  hope  of  solution  by  analytic  means. 
Consequently,  one  must  resort  to  seeking  an  approximate  solution 
via  some  iterative  procedure. 

There  are,  of  course,  many  general  iterative  procedures 
which  are  suitable  for  finding  an  approximate  solution  of  the 
likelihood  equations  and  which  have  been  honed  to  a high  degree 
of  sophistication  within  the  optimization  community.  We  have  in 
mind  here  principally  Newton's  method  and  various  quasi-Newton 
methods  which  are  variants  of  it.  In  fact,  the  method  of 
scoring,  which  was  mentioned  above  in  connection  with  the  work  of 
Rao  [107]  and  which  we  describe  in  detail  in  the  sequel,  falls 


into  the  category  of  Newton-like  methods  and  is  one  such  method 
which  is  specifically  formulated  for  solving  likelihood 
equations . 

Our  main  interest  here,  however,  is  in  a special  iterative 
method  which  is  unrelated  to  Newton's  method  and  which  has  been 
applied  to  a wide  variety  of  mixture  problems  over  the  last 
fifteen  or  so  years.  Following  the  terminology  of  Dempster, 
Laird,  and  Rubin  [38],  we  call  this  method  the  EM  algorithm  (E 
for  "expectation"  and  M for  "maximization").  As  we  mentioned  in 
the  introduction,  it  has  been  found  in  most  instances  to  have  the 
advantage  of  reliable  global  convergence,  low  cost  per  iteration, 
economy  of  storage,  and  ease  of  programming  as  well  as  a certain 
heuristic  appeal;  unfortunately  its  convergence  can  be 
maddeningly  slow  in  simple  problems  which  are  often  encountered 
in  practice. 

The  EM  algorithm  has  been  derived  and  studied  from  at  least 
two  distinct  viewpoints  by  a number  of  authors,  many  of  them 
working  independently.  Hasselblad  [64]  obtained  the  EM  algorithm 
for  an  arbitrary  finite  mixture  of  univariate  normal  densities 
and  made  empirical  observations  about  its  behavior.  In  an 
extension  of  [64],  he  further  prescribed  the  algorithm  for 
essentially  arbitrary  finite  mixtures  of  univariate  densities 
from  exponential  families  in  [65].  The  EM  algorithm  of  [64]  for 
univariate  normal  mixtures  was  given  again  by  Behboodian  [8], 
while  Day  [36]  and  Wolfe  [ ? 3 2 ] formulated  it  for,  respectively, 
mixtures  of  two  multivariate  normal  densities  with  common 
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covariance  matrix  and  arbitrary  finite  mixtures  of  multivariate 
normal  densities.  All  of  these  authors  apparently  obtained  the 
EM  algorithm  independently,  although  Wolfe  [132]  referred  to 
Kasselblad  [64].  They  all  derived  the  algorithm  by  setting  the 
partial  derivatives  of  the  log-likelihood  function  equal  to  zero, 
and  after  some  algebraic  manipulation,  obtained  equations  which 
suggest  the  algorithm. 

Following  these  early  derivations,  the  EM  algorithm  was 
applied  by  Tan  and  Chang  [120]  to  a mixture  problem  in  genetics 
and  used  by  Hosmer  [67]  in  the  Monte  Carlo  study  of  maximum- 
likelihood  estimates  referred  to  earlier.  Duda  and  Hart  [44] 
cited  the  EM  algorithm  for  mixtures  of  multivariate  normal 
densities  and  commented  on  its  behavior  in  practice.  Hosmer  [68] 
extended  the  EM  algorithm  for  mixtures  of  two  univariate  normal 
densities  to  include  the  partially  labeled  samples  described 
briefly  above.  Hartley  [63]  prescribed  the  EM  algorithm  for  a 
"switching  regression"  model.  Peters  and  Walker  [101]  offered  a 
local  convergence  analysis  of  the  EM  algorithm  for  mixtures  of 
multi-variate  normal  densities  and  suggested  modifications  of  the 
algorithm  to  accelerate  convergence.  Peters  and  Coberly  [100] 
studied  the  EM  algorithm  for  approximating  maximum-likelihood 
estimates  of  the  proportions  in  an  essentially  arbitrary  mixture 
density  and  gave  a local  convergence  analysis  of  the  algorithm. 
Peters  and  Walker  [102]  extended  the  results  of  [100]  to  include 
subsets  of  mixture  proportions  and  a local  convergence  analysis 
along  the  lines  of  [101]. 


All  of  the  above  investigators  regarded  the  EM  algorithm  as 
arising  naturally  from  the  particular  forms  taken  by  the  partial 
derivatives  of  the  log-likelihocd  function.  A quite  different 
point  of  view  toward  the  algorithm  was  put  forth  by  Dempster, 
Laird,  and  Rubin  [38].  They  interpreted  the  mixture  density 
estimation  problem  as  an  estimation  problem  involving  incomplete 
data  by  regarding  an  unlabeled  observation  on  the  mixture  as 
"missing"  a label  indicating  its  component  population  of  origin. 
In  doing  so,  they  not  only  related  the  mixture  density  problem  to 
a broader  class  of  statist  cal  problems  but  also  showed  that  the 
EM  algorithm  for  mixture  density  problems  is  really  a 
specialization  of  ? more  general  algorithm  (also  called  the  EM 
algorithm  in  [38])  for  approximating  maximum-likelihood  estimates 
from  incomplete  data.  As  one  sees  in  the  sequel,  this  more 
general  EM  algorithm  is  defined  in  such  a way  that  it  has  certain 
desirable  theoretical  properties  by  it3  very  definition. 
Earlier,  the  EM  algorithm  was  defined  independently  in  a very 
similar  manner  by  Baum  et  al  [7]  for  very  general  mixture  density 
estimation  problems  and  by  Haberman  [57],  [58],  [59]  for 
mixture-related  problems  involving  frequency  tables  derived  by 
indirect  observation.  Haberman  also  refers  in  [59]  to  versions 
of  his  algorithm  developed  by  Ceppellini,  Siniscalco,  and  Smith 
[20],  Chen  [26],  and  Goodman  [50].  In  addition,  an 
interpretation  of  mixture  problems  as  incomplete  date  problems 
was  given  in  the  brief  discu33ion  of  mixtures  by  Orchard  and 
Woodbury  [94].  The  desirable  theoretical  properties 
automatically  enjoyed  by  the  EM  algorithm  suggest  in  turn  the 


good  global  convergence  behavior  of  the  algorithm  tthich  has  been 
observed  in  practice  by  many  investigators.  Theorems  which 
essentially  confirm  this  suggested  behavior  have  been  recently 
obtained  by  Redner  [109],  Boyles  [17],  and  Wu  [134]  and  are 
outlined  in  the  sequel. 


2.5  Identif lability  and  information. 


Z-jo 


To  complete  this  review,  we  touch  on  two  topics  which  have 
to  do  with  the  general  well-posedness  of  estimation  problems 
rather  than  with  any  particular  method  of  estimation.  The  first 
topic,  identif lability,  addresses  the  theoretical  question  of 
whether  it  is  possible  to  uniquely  estimate  a parameter  from  a 
sample,  however  large.  The  second  topic,  information,  relates  to 
the  practical  matter  of  how  good  one  can  reasonably  hope  for  an 
estimate  to  be.  A thorough  survey  of  these  topics  is  far  beyond 
the  scope  of  thi3  review;  we  try  to  cover  below  those  aspects  of 
them  which  have  a specific  bearing  on  the  sequel. 

In  general,  a parametric  family  of  probability  density 
functions  is  said  to  be  identif lable  if  distinct  parameter  values 
determine  distinct  members  of  the  family.  For  families  of 
mixture  densities,  this  general  definition  requires  a special 
interpretation.  For  the  purposes  of  this  paper,  let  us  first  say 
that  a mixture  density  p(xl<2>)  of  the  form  (1.1)  is  economically 
represented  if,  for  each  pair  of  integers  i and  j between  1 
and  m , one  has  that  p^fxl^)  - Pj(xl^)  for  almost  all 
x e Rn  (relative  to  the  underlying  measure  on  Rn  appropriate 
for  p ( x 1 <I> ) ) only  if  either  l - j or  one  of  and  is 
zero.  Then  it  suffices  to  say  that  a family  of  mixture  densities 
of  the  form  (1.1)  is  identifiable  for  © e n if  for  each  pair 

©'  - (“l'  ' ' * ■ * * '^m*  and  * («"i»  • • • ,a"mt^"i»  * " iP"m)  ln 
n determining  economically  represented  densities  p(xl<X>‘)  and 
p ( x I ® ” ) , one  has  that  p(xl©')  = p(x!®")  for  almost  all  x e Rn 
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only  if  there  is  a permutation  it  of  (l,***,m)  such  that 

I St 

- a"7r(i)  and'  if  °i  * 0 ' **i  “ pait(L)  f°r  1 " * 

For  a more  general  definition  suitable  for  possibly  infinite 
mixture  densities  of  the  form  (1.2),  Bee,  for  example,  Yakowitz 
and  Spragin3  [137]. 

It  is  tacitly  assumed  here  that  all  families  of  mixture 
densities  under  consideration  are  identifiable.  One  can  easily 
determine  the  identif lability  of  specific  mixture  densities 
using,  for  example,  the  identif iability  characterization  theorem 
of  Yakowitz  and  Spragins  [137].  For  more  on  identif iability  of 
mixture  densities,  the  reader  is  referred  to  the  papers  of 
Teicher  [123],  [124],  [125],  [126],  Barndorf f-Nielsen  [5], 

Yakowitz  and  Spragin3  [137],  and  Yakowitz  [135],  [136]  and  to  the 
book  by  Maritz  [70]. 


The  Fisher  information  matrix  is  given  by 


(4>)  - | [v^log  P(xl<S>)  ] [v^log  p(xl«»)]  rp(x!«>)dM  , (2.5.1) 


provided  that  p(x!<I>)  is  such  that  this  expression  is  well- 

defined.  (In  writing  , we  suppose  that  one  can  conveniently 

T 

redefine  «>  a3  a vector  <Z>  - (f^,  •••,*)  of  unconstrained  scalar 

)T  . Also,  in  (2.5.1),  m 
4 v 


parameters,  and  we  take  v » ( 


O '6{/ 


denotes  the  underlying  measure  on  Rn  appropriate  for  p(xl«X>)  .) 


The  Fisher  information  matrix  has  general  significance  concerning 
the  distribution  of  unbiased  and  asymptotically  unbiased 
estimates.  For  the  present  purposes,  the  importance  of  the 
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£ Fisher  information  matrix  lies  in  its  role  in  determining  the 
asymptotic  distribution  of  maximum-likelihood  estimates  (see 
Theorem  3.1  below). 

A number  of  authors  have  considered  the  Fisher  information 
matrix  for  finite  mixture  densities  in  a variety  of  contexts.  We 
mention  in  particular  several  investigations  in  which  the  Fisher 
information  matrix  is  of  central  interest.  (There  have  been 
others  in  which  the  Fisher  information  matrix  or  some 

approximation  of  it  has  played  a significant  but  less  prominent 
role;  see  those  of  Mendenhall  and  Hader  [89],  Hasselblad  [64], 
[65],  Day  [36],  Wolfe  [132],  Dick  and  Bowden  [42],  Hosmer  [67], 
James  [73],  and  Ganesalingam  and  McLachlan  [49].)  Hill  [66] 
exploited  simple  approximations  obtained  in  limiting  cases  from  a 
general  power  series  expansion  to  investigate  the  Fisher 
information  for  estimating  the  proportion  in  a mixture  of  two 
normal  or  exponential  densities.  Behboodian  [9]  offered  methods 
for  computing  the  Fisher  information  matrix  for  the  proportion, 
means,  and  variances  in  a mixture  of  two  univariate  normal 
densities;  he  also  provided  four-place  tables  from  which 
approximate  information  matrices  for  a variety  of  parameter 
values  can  be  easily  obtained.  In  their  comparison  of  moment  and 
maximum-likelihood  estimates,  Tan  and  Chang  [121]  numerically 
evaluated  the  diagonal  elements  of  the  inverse  of  the  Fisher 
information  matrix  at  a variety  of  oarameter  values  for  a mixture 
of  two  univariate  normal  densities  with  a common  variance.  Using 
the  Fisher  Information  matrix,  Chang  [22]  investigated  the 
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effects  of  adding  a second  variable  on  the  asymptotic 
distribution  of  the  maximum-likelihood  estimates  of  the 
proportion  and  parameters  associated  with  the  first  variable  in  a 
mixture  of  two  normal  densities.  Later,  Chang  [23]  extended  the 
methods  of  [22]  to  include  mixtures  of  two  normal  densities  on 
variables  of  arbitrary  dimension.  For  a mixture  of  two 
univariate  normal  densities,  Hosmer  and  Dick  [71]  considered 
Fisher  information  matrices  determined  by  a number  of  sample 
types.  They  compared  the  asymptotic  relative  efficiencies  of 
estimates  from  totally  unlabeled  samples,  estimates  from  two 
types  of  partially  labeled  samples,  and  estimates  from  two  types 
of  completely  labeled  samples. 
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3 . Maximum-likelihood 

In  this  section,  maximum-likehood  estimates  for  mixture 
densities  are  defined  precisely,  and  their  important  properties 
are  discussed.  It  is  assumed  that  a parametric  family  of  mixture 

densities  of  the  form  (1.1)  i3  specified  and  that  a particular 

* . * * * * 

® " (®1* • • • , «m, • • • ,*>m)  € n is  the  "true"  parameter  value  to 

be  estimated.  As  before,  it  is  both  natural  and  convenient  to 
regard  p(xl$)  in  (1.1)  as  modeling  a statistical  population 
which  is  a mixture  of  m component  populations  with  associated 
component  densities  ^Pr^i-i  ...  m an(*  mixing  proportions 

tai}i-l, • • • ,m  * 

In  order  to  suggest  to  the  reader  the  variety  of  samples 
which  might  arise  in  mixture  problems  as  well  as  to  provide  a 
framework  within  which  to  discuss  samples  of  interest  in  the 
sequel,  we  introduce  samples  of  observations  in  Rn  of  four 
distinct  types.  All  of  the  mixture  density  estimation  problems 
which  we  have  encountered  in  the  literature  involve  samples  which 
are  expressible  as  one  or  a stochastically  independent  union  of 
samples  of  these  types,  although  the  imaginative  reader  can 
probably  think  of  samples  for  mixture  problems  which  can  not  be 
so  represented.  The  four  types  of  samples  and  the  notation  which 
we  associate  with  them  are  given  as  follows: 

Type  1.  Suppose  that  ...  n i8  an  independent 

sample  of  N unlabeled  observations  on  the  mixture,  i.e.,  a 
set  of  N observations  on  independent,  identically 

f * 

distributed  random  variables  with  density  p(xlC>  ) . Then 


\ 


ORIGINAL 

OF  POOR 

^xk^k-l, . . ,N  i8  a samPle  of  Type  1* 


PAGE 

QUALITY 


261 


Type  2.  Suppose  that  are  arbitrary  non-negative 
integers  and  that  for  i - l,***,m  , ...  j is  an 

independent  sample  of  observations  on  the  ith  component 
population,  i.e.,  a set  of  observations  on  independent, 
identically  distributed  random  variables  with  density 


Then  S2 


m 

u 


i-1 


{yikJk-l, • • • ,Ji 


i3  a sample  of 


Type  2. 


Type  3^  Suppose  that  an  independent  sample  of  K unlabeled 
observations  is  drawn  on  the  mixture,  that  these 
observations  are  subsequently  labeled,  and  that  for 
i - l,***,m  , a set  •••  k them  is  associated 

th  ® 

with  the  i component  population  with  K - £ K.  . Then 

i-1  1 


S 


3 


m 

u 


i-1 


{zik}k-l. 


„ is  a sample  of  Tyoe  3. 

,kt 


Type  4.  Suppose  that  an  independent  sample  of  M unlabeled 
observations  is  drawn  on  the  mixture,  that  the  observations 
in  the  sample  which  fall  in  some  set  E c Rn  are 
subsequently  labeled,  and  that  for  i - l,***,m  , a set 

t*  h 

^wik^k-l  m them  is  thereby  associated  with  the  i 

component  population  while  a set  twQk^k-l  M remains 


unlabeled.  Then 


S4  " iJJ0{wik}k-l,  • 


is  a sample  of 


Type  4. 
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A totally  unlabeled  sample  S 1 of  Type  1 is  tue  sort  of 
sample  considered  in  almost  all  of  the  literature  on  mixture 
densities.  Throughout  mo3t  of  the  sequel,  it  is  assumed  as  i 
convenience  that  samples  under  consideration  are  of  this  type. 
The  major  qualitative  difference  between  completely  labeled 
samples  S2  and  S3  of  Types  2 and  3,  respectively,  is  that  the 
numbers  contain  information  about  the  mixing  proportions 

while  the  numbers  do  not.  Thus  if  estimation  of  proportions 

is  of  interest,  then  a sample  S2  is  useful  only  as  a subset  of 
a larger  sample  which  includes  samples  of  other  types.  For 
mixtures  of  two  univariate  densities,  Hcsmer  [68]  considered 
samples  of  the  forms  u S2  , and  u S3  . Previously 

Tan  and  Chang  [120]  considered  a problem  involving  an  application 
of  mixtures  in  explaining  genetic  variation  which  is  almost 
identical  to  chat  of  [68]  in  which  the  sample  ie  of  the  form 
S1  U ^2  * A^so  Dlck  and  Bowden  [42]  used  a sample  of  the  form 
S1  U S2  in  whlc^  m - 2 an^  J2  - 0 . Hcsmer  and  Dick  [71] 
evaluated  the  Fisher  information  matrix  for  a variety  of  samples 
of  Types  1,  2,  3,  and  their  unions. 

A sample  of  Type  4 is  likely  to  be  associated  with  a 

mixture  problem  involving  censored  sampling.  While  the  numbers 
contain  information  about  the  mixing  proportions,  as  do  the 
numbers  of  a sample  S3  of  Type  3,  they  also  contain 

information  about  the  parameters  of  the  component  densities  while 
the  numbers  K do  not.  An  interesting  and  informative  example 
of  how  a sample  of  Type  4 might  arise  is  the  following,  which  is 
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in  the  area  of  life  testing  and  is  outlined  by  Mendenhall  and 
Hader  [89]. 

Example.  In  life  testing,  one  is  interested  in  testing 
"products"  (systems,  devices,  etc.),  recording  failure  times  or 
causes,  and  hopefully  thereby  being  better  able  to  understand  and 
improve  the  performance  of  the  product.  It  often  happens  that 
products  of  a particular  type  fail  as  a result  of  two  or  more 
distinct  causes.  (An  example  of  Acheson  and  McElwee  [1]  is 
quoted  in  [89]  in  which  the  causes  of  electronic  tube  failure  are 
divided  into  gaseous  defects,  mechanical  defects,  and  normal 
deterioration  of  the  cathode.)  It  is  therefore  natural  to  regard 
collections  of  such  products  as  mixture"  populations,  the 

i 

component  populations  of  which  correspond  to  the  distinct  causes 
of  failure.  The  first  objective  of  life  testing  in  such  cases  is 
likely  to  be  estimation  of  the  proportions  and  other  statistical 
parameters  associated  with  the  failure  component  populations. 

Because  of  restrictions  on  time  available  for  testing,  life 
testing  experiments  must  often  be  concluded  after  a predetermined 
length  of  time  ha3  elapsed  or  after  a predetermined  number  of 
product  units  have  failed,  resulting  in  censored  sampling.  If 
the  causes  of  failure  of  the  failed  products  are  determined  in 
the  course  of  such  an  experiment,  then  the  (labeled)  failed 
products  together  with  those  (unlabeled)  products  which  did  not 
fail  constitute  a sample  of  Type  4. 

The  likelihood  function  of  a sample  of  observations  is  the 
probability  density  function  of  the  random  sample  evaluated  at 
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the  observations  at  hand.  When  maximum-likelihood  estimates  are 
of  interest,  it  is  usually  convenient  to  deal  with  the  logarithm 
of  the  likelihood  function,  called  the  log-likelihood  function, 
rather  than  with  the  likelihood  function  itself.  The  following 
are  the  log-likelihood  functions  t L3  and  L4  of 
samples  , S2  , and  S4  of  Types  1,  2,  3 and  4 
respectively: 


I-tC®)  - E log  p(x.  I«>) 
1 k-1  K 


(3.1) 


m Ji 


^2 (®)  - E E log  PjCyiJ^,) 
z i-1  k-1  1 1K  1 


(3.2) 


m Ki 


L3(«>)  - ^ kE1l09^°iPi(2ik^9i^  + 109  Ki!*"Kj 


(3.3) 


m Mi 


E log  P(w  U)  + E E logfa.p.Cw.Jp.)]  + log  ^ |-r.'rM"Y 
k-0  0K  i-lk-1  1 i IK  1 MqI  wm- 


(3.4) 


Note  that  if  a sample  of  observations  is  a union  of  independent 
samples  of  the  types  considered  here,  then  the  log-likelihood 
function  of  the  sample  is  just  the  corresponding  sum  of  log- 
likelihood  functions  defined  above  for  the  samples  m the  union. 


if  s is  a sample  of  observations  of  the  sort  under 

* 

consideration,  then  by  a maximum- likelihood  estimate  of  <X>  , we 

mean  any  choice  of  O in  n at  which  the  log-likelihood 
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function  of  S , denoted  by  L(<5)  , attains  its  largest  local 
maximum  in  fl  . In  defining  a maximum-likelihood  estimate  in 
thi8  way,  we  have  taken  into  account  two  practical  difficulties 
associated  with  maximum-likelihood  estimation  for  mixture 
densities. 

The  first  difficulty  is  that  one  cannot  always  in  good 
conscience  take  n to  be  a set  in  which  the  log-likelihood 
function  is  bounded  above,  and  so  there  are  not  always  points  in 
n at  which  L attains  a global  maximum  over  0 . Perhaps  the 
most  notorious  mixture  problem  for  which  L is  not  bounded  above 
in  n is  that  in  which  p is  a mixture  of  normal  densities  and 
S * , a sample  of  Type  1.  It  is  easily  seen  in  this  case  that 
if  one  of  the  mixture  means  coincides  with  a sample  observation 
and  if  the  corresponding  variance  tends  to  zero  (or  if  the 
corresponding  covariance  matrix  tends  in  certain  ways  to  a 
singular  matrix  m the  multivariate  case),  then  the  log- 
likelihood  function  increases  without  bound.  For  the  normal 
mixture  problem,  an  advantage  of  including  labeled  observations 
in  a sample  is  that  with  probability  one,  this  difficulty  does 
not  occur  if  the  sample  includes  more  than  n labeled 
observations  from  each  component  population.  This  was  observed 
in  the  univariate  case  by  Hosmer  [68]. 


i 
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The  second  difficulty  is  that  mixture  problems  are  very 
often  such  that  the  log-likelihood  function  attains  it3  largest 
local  maximum  at  several  different  choices  of  O . Indeed,  if  p^ 
and  are  of  the  same  parametric  family  for  some  i and  j 


and  if  S - S1  , a sample  of  Type  1,  then  the  value  of  L(<X>) 
will  not  change  if  the  component  pairs  (<*!/£$_)  and  are 
interchanged  in  <X>  , i.e.,  if  in  effect  there  is  "label 

1_  L.  1.U 

switching"  of  the  i ^ and  j component  populations.  The 
results  reviewed  below  show  that  whether  or  not  such  "label 
switching"  is  a cause  for  concern  depends  on  whether  estimates  of 
the  particular  component  density  parameters  are  of  interest  or 
whether  only  an  approximation  of  the  mixture  density  is  desired. 
We  remark  that  this  "label  switching"  difficulty  can  certainly 
occur  in  mixtures  which  are  identifiable  (see  Section  2.5). 

In  the  remainder  of  this  section,  our  interest  is  in  the 
important  general  qualitative  properties  of  maximum-likelihood 
estimates  of  mixture  density  parameters.  For  convenience,  we 
restrict  the  discussion  to  the  case  which  is  most  often  addressed 
in  the  literature,  namely  that  in  which  the  sample  S at  hand  is 
a sample  of  Type  1.  We  also  assume  that  each  component 
density  p^^  is  differentiable  with  respect  to  -p^  and  make  the 
nonescer.tial  assumption  that  the  parameters  p^  are  unconstrained 
in  n ^ and  mutually  independent  variables.  It  is  not  difficult 
to  modify  the  discussion  below  co  obtain  similar  statements  which 
are  appropriate  for  other  mixture  density  estimation  problems  of 
interest.  For  a discussion  of  the  properties  of  maximum- 
likelihood  estimates  of  constrained  variables,  3ee  the  paper  of 
Aitchison  and  Silvey  [2]. 

The  traditional  general  approach  to  determining  a maximum- 
likelihood  estimate  is  first  to  arrive  at  a system  of  likelihood 


equations  satisfied  by  the  maximum-likelihood  estimate  and  then 


to  try  to  obtain  a maximum-likelihood  estimate  by  solving  the 
likelihood  equations.  Basically,  the  likelihood  equations  are 
found  by  considering  the  partial  derivatives  of  the  log- 
likelihood  »c*"ion  with  respect  to  the  components  of  ® . If 
$ *»  ‘ * * ' »^ra)  is  a maximum-likelihood  estimate,  then 

one  has  the  likelihood  equations 


v 

P 


L($) 

i 


0,1-1, 


,m 


(3.5) 


determined  by  the  unconstrained  parameters  p ^ , i - l,***,m  . 
(Our  convention  is  that  "v*  with  a variable  appearing  as  a 
subscript  indicates  the  gradient  of  first  partial  derivatives 
with  respect  to  the  components  of  the  variable.) 

To  obtain  likelihood  equations  determined  by  the 
proportions,  which  are  constrained  to  be  non-negative  and  to  sum 
to  one,  we  follow  Peters  and  Walker  [102].  Setting 
a - (a,,  ••*,<!  ) , one  sees  that 

1 ul 


0 > V L($)T(a-a) 


(3.6) 


for  all  a - (a^, 


.a.) 

m 


m 

such  that  £ a 
i-1  1 


1 and  > 0 , 


i - l,***,m  . Now  (3.6)  holds  for  all  a satisfying  the  given 
constraints  if  and  only  if 


0 > VaL(®)T(e.-a)  , i - l,...,m  , 


iM  i a? 


with  equality  for  those  values  of  i for  which  > 0 . (Here, 
eA  ie  the  vector  the  i*"*1  component  of  which  is  one  and  the 
other  components  of  which  are  zero.)  It  follows  that  (3.6)  is 
equivalent  to 


1 N 
> s E 

N k-1 


p(x.  16) 


i - i, 


,m 


(3.7) 


with  equality  for  those  values  of  i for  which  > 0 . 
Finally,  multiplying  each  side  of  (3.7)  by  for  i - l,*»',m 
yields  likelihood  equations  in  the  convenient  form 


A 


SjPi(xklgl) 

p(xkl6) 


i _ 1,  • • • ,ni  . 


(3.8) 


We  remark  that  it  is  easily  seen  by  considering  the  matrix 
of  second  partial  derivatives  of  L with  rospecit  to  a^,  •••,(* 
that  L is  a concave  function  of  a « ( ) for  any  fixed 
set  of  values  ^ e , i - 1, • • • ,m  . Thus,  for  any  fixed  , 

i - l,***,m  , (3.6)  and,  hence,  (3.7)  are  sufficient  as  well  as 

necessary  for  a to  maximize  L over  the  set  of  all  a 

satisfying  the  given  constraints.  On  the  other  hand,  the 

likelihood  equations  (3.8)  are  necessary  but  not  sufficient 

conditions  for  a to  maximize  L for  fixed  , l - l,««*,m  . 

Indeed,  a - e^  satisfies  (3.8)  for  i » 1,  . In  fact,  it 

follows  from  the  concavity  of  L that  there  is  a solution  of 
(3.8)  in  each  (closed)  face  of  the  simplex  of  points  a 
satisfying  the  given  constraints.  In  spite  of  perhaps  suffering 


i.'fr  ^ ‘L-JXitA 


, jL  , ) t - ^ ’ r 9 * c** 


r * 1 
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from  a surplus  of  solutions,  the  likelihood  equations  (3.8) 
nevertheless  have  a useful  form  which  takes  on  additional 
significance  later  in  the  context  of  the  EM  algorithm. 

The  equations  (3.5)  and  (3.8)  together  constitute  a full  3et 

of  likelihood  equations  which  are  nece33ary  but  not  sufficient 

conditions  for  a maximum  likelihood  estimate.  Of  course,  some 

irrelevant  solutions  of  the  likelihood  equations  can  be  avoided 

in  practice  by  using  one  of  a number  of  procedures  for  obtaining 

a numerical  solution  of  them  (among  which  is  the  EM  algorithm) 

which  in  all  but  the  most  unfortunate  circumstances  will  yield  a 

local  maximizer  of  the  log-likelihood  function  (or  a singularity 

near  which  it  grows  without  bound)  rather  than  some  stationary 

point  which  is  a local  minimizer  or  a saddle  point.  Still,  it  is 

natural  to  ask  et  this  point  the  extent  to  which  solving  the 

likelihood  equations  can  be  expected  to  produce  a maximum- 

likelihood  estimate  and  the  extent  to  which  a maximum-likelihood 

* 

estimate  can  be  expected  to  be  a good  approximation  of  <J> 

Two  general  theorems  are  offered  below  which  give  a fair 
summary  of  the  results  in  the  literature  most  pertinent  to  the 

question  put  forth  above.  As  a convenience,  we  assume  that 

* 

> 0 for  i - 1, • • • ,m  . For  the  purposes  of  the  theorems  and 

the  discussion  following  them,  this  justifies  writing,  say, 
m-1 

- 1 - E and  considering  the  redefined,  locally 

unconstrained  variable  O - (a,  ,•••,<*  ) in  the 

x m— x x m 

mod  if ied  set 


The  likelihood  equations  (3.5)  and  (3.8)  can  now  be  written  in 
the  general  unconstrained  form 

V.($)  - 0 , (3.9) 

which  facilitates  our  presenting  the  theorems  as  general  results 
which  are  not  restricted  to  the  mixture  problem  at  hand  or,  for 
that  matter,  to  mixture  problems  at  all.  In  our  discussion  of 
the  theorems,  all  statements  regarding  measure  and  integration 
are  made  with  respect  to  the  underlying  measure  on  Rn 
appropriate  for  p(x!$)  , which  we  denote  by  ti  . 

The  first  theorem  states  roughly  that  under  reasonable 
assumptions,  there  is  a unique  strongly  consistent  solution  of 


the  likelihood  equations  (3.9)  and  this  solution  at  least  locally 
maximizes  the  log-likelihood  function  and  13  asymptotically 
normally  distributed.  ( Consistent  in  the  usual  sense  means 
converging  with  probability  approaching  1 to  the  true  parameters 
as  the  sample  size  approaches  infinity;  strongly  consistent  means 
having  the  same  limit  with  probability  1.)  This  theorem  is  a 
compendium  of  results  generalizing  the  initial  work  of  Cramer 
[35]  concerning  existence,  consistency,  and  asymptotic  normality 
of  the  maximum-likelihood  estimate  of  a single  scalar  parameter. 
The  conditions  below,  on  which  the  theorem  rests,  were 
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essentially  given  by  Chanda  [21]  as  raulti-dimen3ional 
generalizations  of  those  of  Cramer.  With  them,  Chanda  claimed 
that  there  exists  a unique  solution  of  the  likelihood  equations 
which  is  consistent  in  the  usual  sense  (this  fact  was  correctly 
proved  by  Tar one  and  Gruenhage  [122])  and  established  its 

asymptotic  normal  behavior.  (See  also  the  summary  in  Kiefer 

[77],  the  discussion  in  Zacks  [71],  and  the  related  material  for 
constrained  maximum-likelihood  estimates  in  Aitchison  and  Silvey 
[2].)  Using  these  same  conditions,  Peters  and  Walker  [101, 

Appendix  A)  showed  that  there  is  a unique  strongly  consistent 
solution  of  the  likelihood  equations  and  that  it  at  least  locally 
maximizes  the  log-likelihood  function. 

In  stating  the  following  conditions  and  in  the  discussion 

after  the  theorem,  it  is  convenient  to  adopt  temporarily  the 

m 

notation  <X>  » ({.,•••,{  ) , where  v ■ (m  - 1 + E n.)  and 

A v i-1  1 

e R*  for  i - l,*--,v  . Also,  we  remark  that  because  the 

results  of  the  theorem  below  implied  by  these  conditions  are 

strictly  local  in  nature,  there  i3  no  loss  of  generality  in 

restricting  n to  be  any  neighborhood  of  <t>*  if  such  a 

restriction  i3  necessary  for  the  first  condition  to  be  met. 

Condition  _1.  For  all  <D  c n , for  almost  all  x e Rn  , and  for 

2 

i,^,k  - 1 , * * • , v , the  partial  derivatives  ^ - , and 

.a3P-.... 

aW*k 


exl3t  and  satisfy 


where  fA  and  are  integrable  and  f^k  satisfies 

| f i-j}c(x)p(xf«>*)dM  < «> 


Condition  2.  The  Fisher  information  matrix  I («1>)  given  by 

* 

(2.5.1)  is  well-defined  and  positive  definite  at  © 

Theorem  J3.1.  If  Conditions  1 and  2 are  satisfied  and  any 

* 

sufficiently  small  neighborhood  of  © in  fl  is  given,  then  with 

probability  1,  there  is  for  sufficiently  large  N a unique 
N 

solution  © of  the  likelihood  equations  (3.9)  in  that 

neighborhood  and  this  solution  locally  maximizes  the  log- 

i—  N * 

likelihood  function.  Furthermore,  \JN(©  -®  ) is  asymptotically 
normally  distributed  with  mean  zero  and  covariance  matrix 


* -1 
K®  ) 


The  second  theorem  is  directed  toward  two  questions  left 

N 

unresolved  by  the  theorem  above  regarding  © , the  unique 

strongly  consistent  solution  of  the  likelihood  equations.  The 

N 

first  question  is  whether  © is  really  a maximum-likelihood 

estimate,  i.e.,  a point  at  which  the  log-likelihood  function 

attains  its  largest  local  maximum.  The  second  is  whether,  even 

if  the  answer  to  the  first  question  is  "ye3",  there  are  maximum- 

N 

likelihood  estimates  other  than  © which  lead  to  limiting 
densities  other  than  p(xl©  ) . Given  our  assumption  of 

identif lability  of  the  family  of  mixture  densities  p(xl©)  , 
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® e 0 , one  easily  sees  that  the  theorem  below  implies  that  if 

* 

n'  13  any  compact  subset  of  n which  contains  ® in  its 

interior,  then  with  probability  1,  <X>**  is  a maximum-likelihood 

estimate  in  fl*  for  sufficiently  large  N . Furthermore,  every 

u 

other  maximum-likelihood  estimate  in  O'  is  obtained  from  <X>‘ 
by  the  "label  switching"  described  earlier  and,  hence,  leads  to 
the  same  limiting  density  p ( x I <I>  ) . Accordingly,  we  usually 
assume  in  the  3equel  that  Conditions  1 through  4 are  satisfied 
and  refer  to  as  the  unique  strongly  consistent  maximum- 

likelihood  estimate.  The  theorem  is  a slightly  restricted 
version  of  a general  result  of  Redner  [110]  which  extends  earlier 
work  by  Wald  [129]  on  the  consistency  of  maximum-likelihood 
estimates.  It  should  be  remarked  that  the  result  of  [110]  rests 
on  somewhat  weaker  assumptions  than  those  made  here  and  is 
specifically  aimed  at  families  of  distributions  which  are  not 
ident if iablc . 

For  <1>  e n and  sufficiently  small  r > 0 , let  Nr(<X>) 
denote  the  closed  ball  of  radius  r about  ® in  fl  and  define 

p(xlc>,r)  - sup  p(x!o>* ) 

<5>*  eNr  (©) 


and 


P 


*(xl4>,r) 


max{l,p(xl«>,r ) } . 


Cond  ition  2-  For  each  C>  £ n and  sufficiently  small  r > 0 
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Condition  4. 


j log  p (x!c>  ,r)p(xl<t>*)d/i  < «»  . 
Rn 

J log  p(xl<X>*)p(xlc>*)dM  < 00  . 

n n 


Theorem  2_. 2 . Let  ft'  bo  any  compact  subset  of  fl  which 
* 

contains  <x>  in  it3  interior,  and  set 


C - {<X>  e fl*  : p(xlc>)  - p(xl®*)  almost  everywhere)  . 


If  Conditions  3 and  4 are  satisfied  and  D is  any  closed  subset 
of  fl'  not  intersecting  C , then  with  probability  1, 


lim 

N-°°  <ZtU 


'1  .of  x ^ i <X>) 


n p(x,J«>*) 

K-l  k 


o . 


From  a theoretical  point  of  view,  Theorems  3.1  and  3.2  are 
adequate  for  mixture  density  estimation  problems  in  providing 
assurance  of  the  existence  ot  strongly  consistent  maximum- 
likelihood  estimates,  characterizing  them  as  solutions  of  the 
likelihood  equations,  and  prescribing  tneir  asymptotic  behavior. 
In  practice,  however,  one  must  still  contend  with  certain 
potential  mathematical,  statistical,  and  even  numerical 
difficulties  associated  with  maximum  -likelihood  estimates.  Some 
possible  mathematical  problems  have  been  suggested  above:  The 
log-likelihood  function  may  have  many  local  and  global  maxima  and 
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perhaps  even  singularities;  furthermore,  the  likelihood  equations 

are  likely  to  have  solutions  which  are  not  local  maxima  of  the 

log-likelihood  function.  According  to  Theorem  3.1,  the 

statistical  soundness  (a3  measured  by  bia3  and  variance)  of  the 

strongly  consistent  maximum-likelihood  estimate  is  determined,  at 

least  for  large  samples,  by  the  Fisher  information  matrix 
* * 

I (<t>  ) . As  it  happens,  I (<J>  ) also  plays  a role  in  determining 
the  numerical  well-posedness  of  the  problem  of  approximating  the 
strongly  consistent  maximum-likelihood  estimate  for  large 
samples . 

* 

To  show  how  I (<X>  ) enters  into  the  problem  of  numerically 
approximating  C>N  for  large  samples,  we  recall  that  the 
condition  of  a problem  is  reflected  by  the  relative  sensitivity 
of  its  solution  to  perturbations  in  the  data  associaced  with  the 
problem.  For  an  optimization  problem,  the  condition  is 
customarily  measured  by  the  condition  number  of  the  Hessian 
matrix  of  the  function  to  be  optimized  evaluated  at  the  solution. 
(For  the  definition  and  properties  of  the  condition  number  of  a 
matrix,  see,  for  example,  Stewart  [118].)  For  the  log-likelihood 
function  at  hand,  the  Hessian  matrix,  which  we  denote  by  H(©)  , 
is  given  by 


H(O) 


N T I 

’ E Vilo<J  P(xkl«>)  ' 

k- 1 


(3.10) 


) . If  Conditions  1 and  2 above  are 


where  v v * - ( - - 

4> 

satisfied,  then  it  follows  from  the  Strong  Law  of  Large  Numoers 


f\r, pr,Z2  13 


s 

(see  Loeve  [82])  that  with  probability  1, 

lim  “ H(©N)  - -I(©*)  . (3.11) 

IN  N 

Sinco  ^ Hi©  ) has  the  same  condition  number  as  H(©  ) , (3.11) 

is  the  desired  result. 

To  illustrate  the  potential  severity  of  the  statistical  and 
numerical  problems  associated  with  maximum-likelihood  estimates, 
we  augment  the  material  on  the  Fisher  information  matrix  in  the 
literature  cited  in  Section  2.5  with  Table  3.3  below,  which  lists 

approximate  values  of  the  condition  number  and  the  diagonal 

* 

elements  of  the  inverse  of  I (<J>  ) for  a mixture  of  two 

univariate  normal  densities  (see  (1.3)  and  (1.4))  at  a variety  of 

* 

choices  of  <X>  To  prepare  this  table,  we  took 

2 2 * 

© » °if  and  numerically  evaluated  I (©  ) , its 

* 

condition  number,  and  its  inverse  for  selected  values  of  © 

using  IMSL  Library  routines  DCADRE,  EIGRS,  and  LINV2P  on  a 

2 * * 
CDC7600.  The  choices  of  © were  obtained  by  taking  - . 3 

. 2*2*  * * 
and  o^  “ °2  " - and  varying  the  mean  separation  - 

* 

In  the  table,  the  condition  number  of  I (©  ) is  denoted  by  k , 

* - 1 

and  the  first  through  fifth  diagonal  elements  of  I(©  ) are 
denoted  by  I-1^)  , I**1^)  , l”1^)  ' ' and  • 

respectively. 


2.  We  are  grateful  to  the  Mathematics  and  Statistics  Division  of 
the  Lawrence  Livermore  National  Laboratory  for  allowing  us  to 
use  their  computing  facility  in  generating  this  table. 


* t 

^1*^2 

K 

l"1(a1) 

r1(A1) 

i_1(k2) 

rV!) 

B 

3.06xl010 

4.39xl010 

4.86xl09 

8 . 98x10® 

2.15xl07 

4 . 02  xlO6 

B 

a.o5xio6 

5.54xl06 

3.81xl06 

7.17xl05 

1.04xl05 

2 . 07xl04 

1.0 

5.18xl04 

8.59xl03 

2.32xl04 

4 . 55«103 

2.58xl03 

578. 

^9 

4.80xl03 

237. 

1.43x10 3 

290. 

383. 

95.0 

2.0 

l.lOxlO3 

20.4 

216. 

45.8 

115. 

31.3 

3.0 

187. 

.874 

18.9 

4.81 

2C.2 

8.83 

71.7 

.267 

5.72 

1.95 

13.4 

4.71 

mi 

35.7 

.211 

3.44 

1.45 

7.47 

3.06 

Table  3.3:  Condition  number  and  diagonal  elements  of  the  inverse 

of  !(«>*)  for  a mixture  of  two  univariate  normal  densities  with 
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Table  3.3  reinforces  ones  intuitive  understanding  that  for, 

mixture  density  estimation  problems,  maximum-likelihood  estimates 

are  more  appealing  from  both  a statistical  and  a numerical 

standpoint  if  the  component  densities  in  the  mixture  are  well 

separated  than  if  they  are  poorly  separated.  Perhaps  the  most 

troublesome  implication  of  Table  3.3  is  that  if  the  component 

densities  are  poorly  separated,  then  impractically  large  sample 

sizes  might  be  required  in  order  to  expect  even  moderately 

precise  maximum-likelihood  estimates.  For  example,  Table  3.3 

indicates  that  if  one  considers  data  from  a mixture  of  two 

* 2*  2 * 

univariate  normal  densities  with  - .3  , “ °2  ” 1 , and 

* * 6 
- H 2 - 1 , then  a sample  size  on  the  order  of  10  is 

necessary  to  insure  that  the  standard  deviation  of  each  component 

of  the  maximum-likelihood  estimate  is  about  0.1  or  less.  Even  if 

a sample  of  such  horrendous  size  were  available,  the  fact  that 

evaluating  the  log-likelihood  function  and  associated  functions 

such  as  its  derivatives  involves  summation  over  observations  in 

the  sample,  considered  together  with  the  condition  number  of 

4 

5.18x10  for  the  information  matrix,  suggests  that  computing 
undertaken  in  seeking  a maximum-likelihood  estimate  should  be 
carried  out  with  great  care. 

Similar  observations  regarding  the  asymptotic  dependence  of 
the  accuracy  of  maximum-likelihood  estimates  on  sample  sizes  and 
separation  of  the  component  populations  have  been  made  by  a 
number  of  authors  (Mendenhall  and  Hader  [89],  Hill  [66], 
Hasselblad  [64],  [65],  Day  [36],  Tan  and  Chang  [121],  Dick  and 


Bowden  [42],  Hosmer  [67],  [68],  Mourner  and  Dick  [71]).  aoveiai 
of  them  (Mendenhall  and  Hader  [89],  Day  [36],  Ha33elblad  [65], 
Dick  and  Bowden  [42],  Hosmer  [67])  also  suggested  that  things  are 
worse  for  small  samples  (less  than  a few  hundred  observations) 
than  the  asymptotic  theory  indicates.  Hosmer  [67]  specifically 
addressed  the  small-sample,  poor-separation  case  for  a mixture  of 
two  univariate  normals  and  concluded  that  in  this  case  maximum- 
likelihood  estimates  "should  be  used  with  extreme  caution  or  not 
at  all."  Dick  and  Bowden  [42],  Hosmer [68],  and  Hosmer  and  Dick 
[71]  offered  evidence  which  suggests  that  considerable 
improvement  in  the  performance  of  maximum-likelihood  estimates 
can  result  from  including  labeled  observations  in  the  samples  by 
which  the  estimates  are  determined,  particularly  when  the 
component  densities  are  poorly  separated.  In  fact,  it  is  pointed 
out  in  [71]  that  most  of  the  improvement  occurs  for  small  to 
moderate  proportions  of  labeled  observations  in  the  sample. 

In  spite  of  the  rather  pessimistic  comments  above,  maximum- 
likelihood  estimates  have  fared  well  in  comparisons  with  most 
other  estimates  for  mixture  density  estimation  problems.  Day 
[36],  Hasselblad  [65],  Tan  and  Chang  [121],  and  Dick  and  Bowden 
[42]  found  maximum-likelihood  estimates  to  be  markedly  superior 
to  moment  estimates  in  their  investigations,  especially  in  cases 
involving  poorly  separated  component  populations.  (See  also  the 
comment  by  Hosmer  [70]  on  the  paper  of  Quandt  and  Ramsey  [106].) 
Day  [36]  also  remarked  that  minimum  chi-square  and  Bayes 
estimates  have  less  appeal  than  maximum-likelihood  estimates, 
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primarily  because  of  the  difficulty  of  obtaining  them  in  most 
cases.  James  [73]  and  Ganesalingam  and  McLachlan  [49]  observed 
that  their  proportion  estimates  are  less  efficient  than  maximum- 
likelihood  estimates;  however,  they  also  outlined  circumstances 
in  which  their  estimates  might  be  preferred.  On  the  other  hand, 
as  we  remarked  in  Sec.  2.3,  the  moment  generating  function  method 
of  Quandt  and  Ramsey  [106]  provides  estimates  which  may 
outperform  maximum-likelihood  estimates  in  the  small-sample  case 
(see  the  comment  by  Hosmer  [70]).  This  method  should  be  kept  in 
mind  as  a promising  alternative  to  the  method  of  maximum 
likelihood. 


■ j 1 
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4.  The  EM  Algor Ithm. 

We  now  derive  the  EM  algorithm  for  gereral  mixture  density 
estimation  problems  and  discuss  its  important  general  properties. 
As  stated  in  the  introduction,  we  feel  that  the  EM  algorithm  for 
mixture  density  estimation  problems  is  best  regarded  as  a 
opeciali2ation  of  the  general  EM  algorithm  formalized  by 
Dempster,  Laird  and  Rubin  [38]  for  obtaining  maximum-likelihood 
estimates  from  incomplete  data.  Accordingly,  we  begin  by 
reviewing  the  formulation  of  the  general  EM  algorithm  given  in 
[38]. 


QUALITY 


I 

I 

♦ 


Suppose  that  one  has  a measure  space  ^ of  "complete  data" 
and  a measurable  map  y - x(y)  of  ^ to  a measure  space  oi' 

"incomplete  data".  Let  f(y!<t>)  be  a member  of  a parametric 
family  of  probability  density  functions  defined  on  ^ for 
a>  e n , and  suppose  that  g(xl®)  is  a probability  density 

function  on  induced  by  f(yl<I>)  . For  a given  x e JC'  , the 

purpose  of  the  EM  algorithm  is  to  maximize  the  incomplete  data 
log-likelihood  L(<X>)  - log  g(xi«J>)  over  <X>  e n by  exploiting  the 
relationship  between  f(yl<X>)  and  g(xl<X>)  . It  is  intended 

especially  for  applications  in  which  the  maximization  of  the 
complete  data  log-likelihood  log  f(yl<X>)  over  <X>  e n is 

particularly  easy. 


For  * c X • aet  (x)  " {/  ♦ The 

conditional  density  k(ylx,<X>)  on  (x)  is  given  by 
f(yl<X>)  - k(y lx,«>)g(xl<!>)  . For  ® and  O'  in  fl  , one  then  has 


L(«D)  - Q(<dI<D')  - H(<dI«D')  , 


where  Q («1> I «l> • ) - E(log  f (y  l<D)  lx,<D' ) and 

H(4>I<D')  - E(log  k(y lx,«D)  lx,<D* ) . The  general  EM  algorithm  of 
Dempster,  Laird  and  Rubin  [38]  is  the  following:  Given  a current 
approximation  <DC  of  a maximizer  of  L(<D)  , obtain  a next 
approximation  <X>+  as  follows: 

1.  E-3tep:  Determine  Q(®lc'C)  . 

2.  M-atep:  Choose  <D+  e arg  max  Q(<dI<DC)  . 

•Den 


Here,  arg  max  Q(<X>l4>c)  denotes 
<Defl 

the  set 

of  values 

•Den  which 

maximize 

Q(«dI<Dc)  over  n . 

(Of  course,  this 

set  must 

be 

nonemp  jy 

for  the  M-step  of  the 

algorithm 

to  be  well 

-def ined. ) 

If 

this  set 

is  a singleton,  then 

we  denote 

its  sole 

member  in 

the 

same  way  and  write  9+  - arg  max  Q(«l>l<x>c)  . similar  notation  is 

•Den 

used  without  further  explanation  in  the  sequel. 

From  this  general  description,  it  is  not  clear  that  the  EM 
algorithm  even  deserves  to  be  called  an  algorithm.  However.-  as 
we  indicated  above,  the  EM  algorithm  is  used  most  often  in 
applications  which  permit  the  easy  maximization  of  log  f(yl<D) 
over  <D  e n . In  such  applications,  the  M-step  maximization  of 
Q(«dI<D’)  over  <D  e n is  usually  carried  out  with  corresponding 
ease.  In  fact,  as  one  sees  in  the  sequel,  the  E-step  and  the  M- 
step  are  usually  combined  into  one  very  easily  implemented  step 
m most  applications  involving  mixture  density  estimation 
problems.  At  any  rate,  the  sense  of  the  EM  algorithm  lies  in  the 
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fact  that  L(#+)  > L (©'')  . Indeed,  the  manner  in  which  ©+  is 

determined  guarantees  that  Q (<C>+ 1 4>c ) > Q(«>cl®c)  ; and  it  follows 

from  Jensen's  Inequality  that  H(<6+l4>c)  < H(<I>CI<J>C)  . (See 

Theorem  1 of  Dempster,  Laird  and  Rubin  [38].)  This  fact  implies 
that  L is  monotone  increasing  on  any  iteration  sequence 

generated  by  the  EM  algorithm,  which  is  the  fundamental  property 
of  the  algorithm  underlying  the  convergence  theorems  given  below. 

To  discuss  the  EM  algorithm  for  mixture  density  estimation 

problems,  we  assume  as  in  the  preceding  section  that  a parametric 

family  of  mixture  densities  of  the  form  (1.1)  is  specified  and 

* * * « * 

that  a particular  4>  - («,,-••, a ,{>.,•••  ,p  ) is  the  "true" 

x m x m 

parameter  value  to  be  estimated.  In  the  usual  way,  we  regard 
thi3  family  of  densities  a3  being  associated  with  a statistical 
population  which  is  a mixture  of  m component  populations.  The 
EM  algorithm  for  a mixture  density  estimation  problem  associated 
with  this  family  i3  derived  by  first  interpreting  the  problem  as 
one  involving  incomplete  data  and  then  obtaining  the  algorithm 
from  its  general  formulation  given  above.  The  problem  i3 
interpreted  as  one  involving  incomplete  data  by  regarding  each 
unlabeled  observation  in  the  sample  at  hand  a3  "missing"  a label 
indicating  its  component  population  of  origin. 

It  is  instructive  to  consider  the  forms  which  the  EM 
algorithm  might  take  for  mixture  density  estimation  problems 
Involving  samples  of  the  types  introduced  in  the  preceding 
section.  We  first  illustrate  in  some  detail  the  derivation  of 
the  function  Q ( <t> I <X> * ) of  the  E-step  of  the  algorithm,  assuming 
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for  convenience  that  the  sample  at  hand  is  a sample 
. >N  Type  1 described  in  the  preceding  section. 
One  can  regard  as  a sample  of  incomplete  data  by  considering 

each  Xj.  to  be  the  "known"  part  of  an  observation 

Yy.  - (xk,  ik)  , where  ik  is  an  integer  between  1 and  m 

indicating  a component  population  of  origin  of  xk  . For 
<b  » (a.,  — , a , • • • ,p)  e 0 , the  sample  variables 

x - (x^,-**,xN)  and  y - (y^f***,yN)  have  associated  probability 

density  functions  g(xl<l>)  - n p(x.l4>)  and 

k-1  K 
N 

f(yl®)  - II  “j  p-  (x.  \p.  ) , respectively.  Then  for 

k-1  xk  xk  * 1k 

I 11*1 

<&'  " (a^r  • • • 9 • • • /Pm)  € n * the  conditional  density 

k(Zlx,<r>’  ) is  given  by 


k(ylx,®’  ) 


N 

n 

k-1 


a[  p.  (x  1/  ) 
xk  xk  . xk 
p(xkl®' ) 


9 


and  the  function  Q(<X>la>')  , which  we  denote  by  Q -^ ( <l> I «> ' ) , is 
determined  to  be 


m N 

E E 
i«-l  k-1 


log  a.  p.  (x.  \p,  ) 
xk  xk  K xk 


N 

n 

k-1 


ai  Pi  <V*i  > 
xk  xk  K xk 

P(xkl®’ ) 


m N 

-EE 

i-1  k-1 


i°ga],Pi(xk:^i) 


giPj(xkl^i) 

p(xkt®’ ) 


(4.1) 


i 
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“ f N “iPi(xk,^i)  m N “iP^**1**) 

’ "pTv^T  1 1o9  “i  + £ log  Pi<VV-w-iW" 


For  samples  S2  - . . . ^ . 

S3  ' 1“1<Iik)lc-l,  •••,K1  ’ “"d  S4  " 1“0(”lk’k-l,...,M1  ot  ^P33 
2,  3 and  4,  one  determines  in  a similar  manner  the  respective 
functions  Q^C’!®')  , C4** I * ) , and  (<t» I <t> * ) for  the  E-step  of 

the  EM  algorithm  to  be 


m Ji 


Q-CoU’)  - E E log  PiCYiJ*,)  * 
* i-1  k-1  1 1K  1 


(4.2) 


» Ki 


Q,(4>k')  - E K.log  a.  + E E log  p.(z.. \p.)  , (4.3) 

3 i-1  1 1 i-1  k-1  i ik  i 


_ 0 «iPi(w0}rU}1) 

- E (Mi  + E ---+  %-r,1  llog  a + (4.4) 

q i-1  1 k-1  plw0k  } 1 


« Mi 


aiPi(w0k'^i> 


+ EtE  log  Pi(wikUi)  + E log  P^l^) 


i-1  k-1 


for  C - (a.  , • • • ,a  ,*>. » • • • ) and  ©'  - (a,  , • • • ,a  ,*j,  , • • • ,*>  ) m 

x in  x in  x m x m 

n . We  note  that  Q2(<sl4>')  and  (<I> I <5» ’ ) are  just  L^®)  and 
(except  for  an  additive  constant)  L3 C*3>)  given  by  (3.2)  and 
(3.3),  respectively;  and  one  might  well  wonder  why  they  are  of 
Interest  in  this  context.  By  way  of  explanation,  we  observe  that 
if  a sample  of  interest  is  a stochastically  independent  union  of 


smaller  samples,  then  the  function  for  the  E-step  of  the  EM 


algorithm  which  is  appropriate  for  this  sample  is  just  the  sum  of 
the  functions  which  are  appropriate  for  the  smaller  samples. 

Thus,  for  example,  if  S - u S2  u S3  is  a union  of  independent 
samples  of  Types  1,  2,  and  3,  then  the  function  for  the  E-step 
appropriate  for  S is  Q(®l®’)  - (<t> I «> • ) + Q2(®l®')  + Q3(®l®’)  , 

where  Q^®!®')  , Q2(®l®')  , and  Q3(®l®')  are  given  by  (4.1), 

» 

(4.2),  and  (4.3),  respectively.  ! 

Having  determined  an  appropriate  function  Q(®l®')  for  the  ! 

E-step  of  the  EM  algorithm  as  one  or  a sum  of  the  functions 

j 

Q.(®l®')  defined  above,  one  is  likely  to  find  that  the  ! 

1 | 

maximization  problem  of  the  M-step  has  a number  of  attractive  > 

I 

features.  It  is  clear  from  (4.1),  (4.2),  (4.3),  and  (4.4)  that  ! 

this  maximization  problem  separates  into  two  maximization  ! 

problems,  the  first  involving  the  proportions  alf •••,<*  alone 

< 

and  the  second  involving  only  the  remaining  parameters 

. Since  log  a^, •••,log  am  appear  linearly  in  each 
function  Q^(®l®')  for  i * 2 , the  first  maximization  problem 
has  a unique  solution  if  the  sample  is  not  strictly  of  Type  ?; 
and  this  solution  is  easily  and  explicitly  determined  regardless 
of  the  functional  forms  of  the  component  densities  p^xlv^)  . 

If  are  mutually  independent  variables,  then  the 

second  maximization  problen  separates  further  into  m component 
problems,  each  of  which  involves  only  one  of  the  parameters  p^  . 

Both  these  component  problems  and  the  maximization  problem  for 

4 

the  proportions  alone  have  the  appealing  property  that  they  can 
be  regarded  as  "weighted"  maximum-likelihood  estimation  problems 


involving  sums  of  logarithms  weighted  by  posterior  probabilities 
that  sample  observations  belong  to  appropriate  component 

populations,  given  the  current  approximate  maximum-likelihood 

* 

estimate  of  © 

To  illustrate  these  remarks,  we  consider  a sample 
St  - {x.}.  . „ of  Type  1 and  assume  that  v>,,  •••,/>  are 

mutually  independent  variables.  If  ©c  - («J,  • • • • • • ,v>®) 

is  a current  approximate  maximizer  of  the  log-likelihood  function 
L^(©)  given  by  (3.1),  then  one  easily  verifies  that  the  next 
approximate  maximizer  ©+  - (a*,  • • • , a*,v>*,  • • • ,v>*)  prescribed  by 
the  M-step  of  the  EM  algorithm  satisfies 


a 


+ 

i 


1 “ «iPi<V'?> 

N *-•  r 

k-1  p(xkl©°) 


(4.5) 


. N aiPi(xk,1pi) 

V>T  e arg  max  E log  P.  (Xv.lv>,)  r (4.6) 

1 ptenL  k"l  1 K . P(xkl©c) 

for  i - l,***,m  . Note  that,  as  promised,  each  a*  is  uniquely 
and  explicitly  determined  and  each  a|  and  v>^  is  obtained  as 
the  solution  of  a weighted  maximum-likelihood  estimation  problem 

“ipi(xklp!) 

Involving  a sum  of  logarithms  multiplied  by  weights  — , 

p(xkl©  ) 

each  of  which  is  just  the  posterior  probability  that  xk 
originated  in  the  i^  component  population,  given  the  current 

Q 

approximate  maximum-likelihood  estimate  © 


i 


f 


j 


'«  s «. 


In  addition  co  prescribing  each  a*  and  ^ as  the  solution 
of  a heur  istically  appealing  weighted  maximum-likelihoAtl 
estimation  problem,  there  are  other  attractions  to  (4.5)  and 
(4.6).  For  example,  (4.5)  insures  that  the  next  approximate 
proportions  at  inherit  from  the  current  approximate  proportions 
a^  the  property  of  being  non-negative  and  summing  to  1 . 
Furthermore,  although  there  is  no  guarantee  that  the  maximization 
problems  (4.6)  will  have  nice  properties  in  general,  it  happens 
that  each  p+^  is  usually  easily  (even  uniquely  and  explicitly) 
determined  by  (4.6)  in  most  applications  of  interest,  especially 
in  those  application?  in  which  “»aci  component  density  P^xl.*^) 
is  one  of  the  common  parametric  densities  for  which  ordinary 
(labeled-sample)  maximum-likelihood  estimates  of  p ^ are 
uniquely  and  explicitly  determined.  As  an  illustration,  consider 
the  case  in  which  some  pi(xl#>i)  is  a multivariate  normal 
density,  i.e.,  pi(xl^>i)  and  p^  aro  given  by 


1 -l/2(x-MjTrT1(x-Ai) 

5i^x  ” n/2  1/2  e r -P i m iZ  ) , 

1 1 (2?r)n/^(detEi)  1 11 


(4.7) 


where  p ^ e Rn  and  is  a positive-definite  symmetric  n«n 
matrix.  For  a given  , the  unique  solution 
■o*  - (m^,£^)  of  (4.6)  i3  given  by 


+ N “iPi^k'^i5  W 

- ( E x -i-i-Jl  l-j/t  E -Li—  K — ) 

k*l  * p(xkIOc)  k-1  p(xkIOc) 


ti  .■-* -j 


QP  PUtK  V 


i > • 


N 


N 

c >/(  E 
pCXj^l®  ) k**l 


P(xkl«>c) 


} .(4.9) 


(The  factors  have  been  left  in  the  numerators  and 
denominators  of  these  expressions  for  aesthetic  reasons  only.) 
Note  that  is  positive-definite  symmetric  with  probability  1 
if  N > n . 


What  convergence  properties  hold  for  a sequence  of  iterates 
generated  by  apply  i».g  the  EM  algorithm  to  a mixture  density 
estimation  problem?  If  nothing  in  particular  is  said  about  the 
parametric  family  of  interest,  then  the  properties  which  can  be 
specified  are  essentially  those  obtained  by  specializing  the 
convergence  results  associated  with  the  EM  algorithm  for  general 
incomplete  data  problems.  The  convergence  results  below  are 
formulated  so  that  they  are  valid  for  the  EM  algorithm  in 

general.  Po<nt3  relating  to  these  results  which  are  of 
particular  interest  in  the  mixture  density  context  are  made  in 
remarks  following  the  theorems. 

The  first  theorem  is  a global  convergence  result  for 
sequences  generated  by  the  EM  algorithm.  It  essentially 

summarizes  the  results  of  Wu  [134]  for  the  general  EM  algorithm 

and  of  Redner  [109]  for  the  more  specialized  case  of  the  EM 

algorithm  applied  to  a mixture  of  densities  from  exponential 
families.  Similar  but  weaker  results  have  been  formulated  for 
the  general  EM  algorithm  by  Boyles  [17].  Statements  (i),  (li), 

and  (iii)  of  the  theorem  are  valid  for  any  sequence  and  are 
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stated  here  a3  a convenience  because  of  their  usefulness  in 
applications.  Statements  (iv),  (v) , and  (vi)  are  based  on  the 
fact  reviewed  earlier  and  reiterated  in  the  statement  of  the 
theorem  that  the  log-likeiinood  function  increases  monotonically 
on  a sequence  generated  by  the  EM  algorithm.  Through  the  use  of 
this  fact,  the  theorem  can  be  related  to  general  results  in 
optimization  theory  such  as  the  convergence  theorems  of  Zangwill 
[141;  pages  91,  128,  and  232]  concerning  point-to-set  maps  which 
increase  an  objective  function.  One  such  general  result  was  used 
explicitly  by  Wu  [134]  in  his  study  of  the  EM  algorithm. 

Theorem  4.1:  Suppose  that  for  some  <X>^^  e n , {©^^J^g  ^ 2 
is  a sequence  in  n generated  by  the  EM  algorithm,  i.e.,  a 
sequence  in  n satisfying 

e arg  max  Q(©l©^^)  , j - 0,1,2,*«*  , 

©en 

where  Q(©l©*)  is  the  function  determined  in  the  E-step  of  the 
EM  algorithm.  Then  the  log-likelihood  function  L(®)  increases 
monotonically  on  {©^J^g  2 to  a (possibly  infinite) 

t 

limit  L . Furthermore,  denoting  the  set  of  limit  points  of 
(®U>)  ^ 2 in  n by  oC,  one  has  the  following: 

(i)  o2?  is  a closed  set  in  fi  . 

( 1 ) 

(ii)  If  {©  }^_0  i £ •••  18  COAtained  in  a compact  subset  of 

n , then  is  compact. 

(iii)  If  i £•••  *s  conbained  in  a compact  subset  of 

n and  lim  il©^+^-®^^  II  - 0 for  a norm  II  • II  on  O , then  cC. 
j-°° 
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is  connected  as  well  as  compact. 

(iv)  If  L(<t)  is  continuous  in  n and  SC  * Q , then  L*  is 
finite  and  L(&)  - L for  £ c <j£. . 

(v)  If  Q(<X>(«>')  and  H (<I* I <1> * ) - Q («» 1 <I> • ) - L(«>)  are  continuous 

in  <E>  and  O'  in  n , then  each  $ e <£.  satisfies 

$>  c arg  max  Q(<X>I&)  . 

Oen 

(vi)  If  Q («> !<!>•)  and  H (O I <l> * ) are  continuous  in  <J>  and 

in  fl  and  differentiable  in  $ at  ® ■ O'  ■ $ c , then  L(<t>) 
is  differentiable  at  <X>  » $ and  the  likelihood  equations 

v^L(C>)  - 0 are  satisfied  by  C>  - $ . 

Proof : The  monotonicity  of  L(<3>)  on  ^ j •••  has 

already  been  established;  the  existence  of  a (possibly  infinite) 

* 

limit  L follows.  Statement  (i)  holds  since  closedness  is  a 

general  property  of  sets  of  limit  points.  To  obtain  (ii),  not-e 
that  if  i 2 •••  *s  contained  in  a compact  subset  of 

fl  , then  dC.  is  a closed  subset  of  this  compact  subset  and, 
hence,  is  compact.  To  prove  (iii),  suppose  that 
^ 2 *s  contained  in  a compact  subset  of  fi  , that 


limilc^+1)  - *U>II  - 0 

j —CO 

, and 

that 

is  not 

connected.  Since 

eg?  is  compact,  there 

is  a 

minimal  distance 

between  distinct 

components  of  SC  ; and 

the 

fact 

that  limllfc 

1 — oo 

0+1)  _ *0)||  _ o 

implies  that  there 

is 

an 

■J 

inf inite 

subsequence  of 

i 2 whose  i.embers  are  bounded  away  from  SC  • This 
subsequence  lies  in  a compact  set,  and  so  it  has  limit  points. 
Since  these  limit  points  cannot  be  in  oC  > one  has  a 
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contradiction. 

Statement  (iv)  follows  immediately  from  the  monotonicity  of 
L(©)  on  i 2 •••  ’ To  Prove  (v)»  suppose  that  Q (<1>  l C»  * ) 

and  H(©l©’)  are  continuous  in  <X>  and  O'  in  n and  that  one 
can  find  some  $ e &C  and  © c n for  which  Q(®l®)  > Q(&l&)  . 
Then  for  every  j , 

L(©(;3+1))  - Q(®(:)+1)  1©^^)  - H(©(:)+1)  l©(:5)) 

> Q(©l©^)  - H(®(^l©(^) 

by  the  H-step  determination  of  ®^+^  an(j  Jensen's  inequality. 
Since  Q(® I®' ) and  H(©l©')  are  continuous,  it  follows  by 
taking  limits  along  a subsequence  converging  to  3 that 

L*  > Q(©l$)  - H($l©) 

> Q(®lo>)  - H(£>I<I>)  *■  L(©)  - L , 

which  is  a contradiction.  To  establish  (vi),  suppose  that 

Q (© i © ' ) and  H(®l©')  are  continuous  m © and  ©'  in  n and 

differentiable  m © at  © - ©'  - ® e oC . Then  L(©)  - Q(®l$)  - 

H(©I8)  is  differentiable  at  © - & ; and,  since 

© e arg  max  Q(®l$)  by  (v)  and  © e arg  max  H(©l&)  by  Jensen’ 3 

©en  ®en 

inequality,  one  has  v^Ll©)  - 0 . This  completes  the  proof. 

Statement  (lii)  of  Theorem  4.1  has  precedent  in  such  results 

as  Theorem  28.1  of  0strow3ki  [96].  It  is  usually  satisfied  in 


A 
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practice,  especially  in  the  mixture  density  context.  Indeed,  it 

4. 

often  happens  that  each  © is  uniquely  determined  by  the  EM 
algorithm  as  a function  of  <»  which  is  continuous  in  n . For 
example,  one  see3  that  a*,  •••,<!*  are  determined  in  this  way  by 
(4.5)  whenever  each  P^xl^)  depends  continuously  on  p^  . In 
addition,  each  p^  is  likely  to  be  determined  in  this  way  by  (4.6) 
whenever  each  p^(xlp^)  is  one  of  the  common  parametric  densities 
for  which  ordinary  maximum-likelihood  estimates  are  determined  as 
a continuous  function  of  p ^ ; see,  for  example,  (4.8)  and  (4.9). 

4 C 

If  4>  i s determined  in  this  way  from  4>  and  if  the  conditions 

of  (v)  are  also  satisfied,  then  each  S £ ^ is  a fixed  point  of 

a continuous  function.  It  follows  that  if  in  addition 

(4>^^)  nit  is  contained  in  a compact  subset  of  fl  , then 

the  elements  of  a "tail"  sequence  j+^  can  all  be 

made  to  lie  arbitrarily  close  to  the  compact  set  oC.  by  taking  J 

sufficiently  large  and,  hence,  limll<X>^+^  - - 0 by  the 

j -*°° 

uniform  continuity  near  X of  the  function  determining 
from  • . 

It  is  useful  to  expand  a little  on  the  interpretation  of 
statement  (vi)  in  the  mixture  density  context.  Assuming  that 
each  pi(x!^>i)  is  differentiable  with  respect  to  p^  , that  the 
parameters  p^  are  unconstrained  in  ni  and  mutually 

independent,  and,  for  convenience,  that  the  sample  of  interest  is 
of  Type  1,  one  can  reasonably  interpret  the  likelihood  equations 
( <2> ) - 0 in  the  3enso  of  (3.9)  at  a point 
~ (<*1,  • • • , “m»^i,  *'  • ’*>m)  e cC.  which  is  such  that  each 


is 


to  be  zero 
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positive.  Now  it  is  certainly  possible  for  some 
for  & e t in  which  case  (3.9)  might  not  be  valid. 
Fortunately,  (3.5)  and  (3.8)  provide  a better  interpretation  than 
(3.9)  of  the  likelihood  equations  in  the  mixture  density  context 
which  is  valid  whether  each  is  positive  or  not.  Indeed,  if 

the  conditions  of  (v)  hold,  then  it  follows  from  (4.5)  that  the 
equations  (3.8)  are  satisfied  on  . Thus  in  the  mixture  density 
context  under  the  present  assumptions,  (vi)  should  be  replaced 
with  the  following: 


(Vi)* 

If 

Q (0 1 0 1 ) and 

H (<t>  I «» • ) are 

continuous 

in  O and 

O' 

in  n 

and 

differentiable 

in 

at  O-O' 

- $ e oC  , 

then 

L(O) 

is 

differentiable 

in 

•fm  at  ® 

- $ and 

the 

likelihood  equations  (3.5)  and  (3.8)  are  satisfied  by  <I>  . 

To  illustrate  the  application  of  Theorem  4.1,  we  consider 

the  problem  of  estimating  the  proportions  in  a mixture  under  the 

assumption  that  each  component  density  pi(xl.p^)  is  known  (and 

denoted  for  the  present  purposes  by  Pi(x)  for  simplicity).  The 

theorem  below  is  a global  convergence  result  for  the  EM  algorithm 

applied  to  this  problem.  For  convenience  in  presenting  the 

theorem,  it  is  assumed  that  the  sample  at  hand  is  a sample 

S.  - {x,  }.  , „ of  Type  1.  Similar  results  hold  for  other 

cases  in  which  the  sample  at  hand  is  one  or  a union  of  the  types 

considered  in  the  preceding  section.  For  this  problem,  one  has 

simply  O - (o^,*-*,aro)  ; and  it  is,  of  course,  always  understood 
m 

that  £ a.  - 1 and  a.  > 0 , i - l,***,m  , for  all  such  <X>  . 
i-1  1 1 
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We  remark  that  the  condition  of  the  theorem  on  the  matrix  of 
second  derivatives  of  L^(<$)  is  quite  reasonable.  This  matrix 
is  always  defined  and  negative  semi-definite  whenever 

p(xkl<&)  i*  0 for  k - 1,  • • • ,N  ? and  if  p1(x) , • • • , pm(x)  are 

linearly  independent  non-vanishing  functions  on  the  support  of 
the  underlying  measure  on  Rn  appropriate  for  p , then  with 
probability  1 it  is  defined  and  negative  definite  for  all  <J> 
whenever  N is  sufficiently  large. 


Theor em  Suppose  that  the  matrix  of  second  derivatives  of 
L^(<X>)  is  defined  and  negative  definite  for  all  <t>  . Then  there 
is  a unique  maximum-likelihood  estimate;  and  for  any 
«^0)  - (aj0), • • • ,a£0))  with  a[0)  >0  for  i - l,--*,m  , the 


sequence 


(j)  . (a«> 


,<j) 


)}. 


1 ' '“m 

algorithm,  i.e.,  determined  inductively  by 


generated  by  the  EM 


(1+1)  1 N “P^l^k* 

' - £ E — Vp  , i - * * * »m  , 

N k-1  p(xkl®u;) 


converges  to  the  maximum-likelihood  estimate. 


Proof : It  follows  from  Theorem  4.1  and  the  subsequent  remarks 
that  the  set  of  limit  points  of  i 2 •••  *8  a compact, 

connected  subset  of  the  simplex  of  proportion  vectors  <t>  on 
which  the  likelihood  equations  (3.8)  are  satisfied.  Since  the 
matrix  of  second  derivatives  of  L^(Ci)  is  negative  definite, 

L^(<X>)  is  strictly  concave.  It  follows  that  there  is  a unique 
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maximum-likelihood  estimate  and,  furthermore,  that  the  likelihood 
equations  (3.8)  have  at  most  one  solution  on  the  interior  of  each 
face  of  the  proportion  simplex.  Consequently,  each  component  of 
the  3et  of  solutions  of  the  likelihood  equations  consists  of  a 
single  point;  and  {O^  ^ ) j— o 1 2 •••  must  converge  to  one  such 
point.  But  if  1 2 ...  is  convergent,  then  its  limit 
must  be  the  maximum-likelihood  estimate  by  Theorem  2 of  Peters 
and  Coberly  [100]  or  Theorem  1 of  Peters  and  Walker  [102]. 


$ 


Despite  the  usefulness  of  Theorem  4.1  in  characterizing  the 
set  of  limit  points  of  an  iteration  sequence  generated  by  the  EM 
algorithm,  it  leaves  unanswered  the  questions  of  whether  such  a 
sequence  converges  at  all  and,  if  it  does,  whether  it  converges 
to  a maximum-likelihood  estimate.  In  an  attempt  to  provide 
reasonable  sufficient  conditions  under  which  the  answer  to  these 
questions  is  "yes",  we  offer  the  local  convergence  theorem  below. 


Theorem  Suppose  that  Conditions  1 through  4 of  Section  3 are 

satisfied  in  n , and  let  fl1  be  a compact  subset  of  fl  which 
* 

contains  4>  in  its  interior  and  which  is  such  that 

p(xl<I>)  “ p(xl<X>  ) almost  everywhere  in  x for  ® e fl1  only  if 

* 

<r  - 4>  . Suppose  further  that  with  probability  1,  the  function 

Q («P I C> ' ) of  the  E-step  of  the  EM  algorithm  is  continuous  in  <I> 
and  O'  in  O'  and  both  Q(<X>la>’)  and  the  log-likelihood 

function  L(<I>)  are  differentiable  In  <X>  for  <x>  and  <X>'  in  o' 
whenever  N is  sufficiently  large.  Finally,  for  in  O'  , 

denote  by  ^ 2 a sequence  generated  by  the  EM 


i 


& pgor 

algorithm  In  fi'  , i.e.,  a sequence  in  n*  satisfying 


$(j+l)  e arg  maj:  q(4)|®^^)  , j - 0,1,2, •••  . 

Ocn' 

the 
is 
is 


Then  with  probability  1,  whenever  N is  sufficiently  large, 


unique  strongly  consistent  maximum-likelihood  estimate  <t> 


N 


well-defined  in  fl' 


and 


sufficiently  near  <t> 


N 


- lim 

j-oo 


whenever 


(0) 


r.  • 


Proof : It  follows  from  Theorems  3.1  and  3.2  that  with  probability 
1,  N can  be  taken  sufficiently  large  that  the  unique  strongly 
consistent  maximum-likelihood  estimate  <X>N  is  well-defined,  lies 


in 

the  interior  of 

n'  , and  is 

the  unique  maximizer 

of  L(<X>) 

in 

n' 

. Also  with 

probability  1 

, we  can  assume 

that 

N 

is 

sufficiently  large 

that  QCcIo') 

is  continuous  in 

O 

and 

O' 

in 

O' 

and  Q(4>l4>') 

and  L(S>)  are  differentiable 

in 

4> 

for 

<J> 

and 

•D'  in  n * . 

Since  L(<$) 

is  continuous, 

one 

can 

f md 

a 

neighborhood  n"  of  of  the  form 

fl"  - {«>  e O'  : L(G>)  > L(4>N)  - e) 


for  some  e > 0 which  lies  in  the  interior  of  n’  and  which  is 
N 

such  that  is  the  only  solution  of  the  likelihood  equations 

contained  in  it.  If  <t>^)  lies  in  fl"  , then  ^ 2 


must  also  lie 

in  H"  since  L(O) 

is  monotone  increasing 

on 

1 * j-0 , 1 , 2 , • 

. It  follows 

that  each  limit 

point 

of 

1 j3-0,1,2,- 

lies  in  fl"  and, 

• • 

by  statement  (vi) 

of  Theorem 

Lyo 
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4.1,  also  satisfies  the  likelihood  equations.  Since  is  the 
only  solution  of  the  likelihood  equations  in  fl"  , one  concludes 
that  <t>N  - lim  <J>^  . 


A3  in  the  case  of  Theorem  4.1,  Theorem  4.3  is  stated  so  that 
it  is  valid  for  the  EM  algorithm  in  general.  It  should  be  noted, 
however,  that  Theorem  4.3  makes  heavy  use  of  Theorems  3.1  and  3.2 
as  well  as  Theorem  4.1;  and  so  for  mixture  density  estimation 
problems,  it  pertains  as  it  stands,  strictly  speaking,  to  the 
case  to  which  Theorems  3.1  and  3.2  apply,  namely  that  in  which 
the  sample  at  hand  is  of  Type  1 and  L(5>)  - L^(<X>)  given  by  (3.1) 
and  Q (0 1 • ) - Q.($l$()  given  by  (4.1).  Of  course,  Theorems  3.1 
and  3.2  and,  therefore,  Theorem  4.3  can  be  modified  to  treat 
mixture  density  estimation  problems  involving  samples  of  other 
types . 
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5.  The  EM  Algor ithm  for  Mlxturea  of  Densities  from  Exponential 
Families 

Almost  all  mixture  density  estimation  problems  which  have 
been  studied  in  the  literature  involve  mixture  densities  whose 
component  densities  are  members  of  exponential  families.  As  it 
happens,  the  EM  algorithm  is  especially  easy  to  implement  on 
problems  involving  densities  of  this  type.  Indeed,  in  an 
application  of  the  EM  algorithm  to  such  a problem,  each 
successive  approximate  maximum-likelihood  estimate  <X>+  is 
uniquely  and  explicitly  determined  from  its  predecessor  <I>  , 
almost  always  in  a continuous  manner.  Furthermore,  a sequence  of 
iterates  produced  by  the  EM  algorithm  on  3uch  a problem  is  likely 
to  have  relatively  nice  convergence  properties. 

In  this  section,  we  first  determine  the  special  form  which 
the  EM  algorithm  takes  for  mixtures  of  densities  from  exponential 
families.  We  then  look  into  the  desirable  properties  of  the 
algorithm  and  sequences  generated  by  it  which  are  apparent  from 
this  form.  Finally,  we  discuss  several  specific  examples  of  the 
EM  algorithm  for  component  densities  from  exponential  families 
which  are  commonly  of  interest. 

A very  brief  discussion  of  exponential  families  of  densities 
is  in  order.  For  an  elaboration  on  the  topics  touched  on  here, 
the  reader  i3  referred  to  the  book  of  Barndorff  Nielsen  [6].  A 
parametric  family  of  densities  q(xle)  , fl  £ n £ , on  Rn  is 
said  to  be  an  exponential  family  if  it3  members  have  the  form 
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q(xl 9)  - a(0)_1b(x)ee  t(x)  , x € Rn  , 


(5.1) 


where  b : RN  - R1  , t : Rn  - R^  , and  a(0)  is  given  by 


a(  0) 


'I 


b(x)e6  t(x)d/i 


n 


for  an  appropriate  underlying  measure  n on  Rn  . It  is,  of 
course,  assumed  that  b(x)  > 0 for  all  x e Rn  and  that 
a(0)  < » for  0 e n . Note  that  every  member  of  an  exponential 
family  has  the  same  support  in  Rn  , namely  that  of  the  function 


b(x)  . 


The  representation  1,5.1)  of  the  members  of  an  exponential 
family,  in  which  the  parameter  0 appears  linearly  in  the 
argument  of  the  exponential  function,  is  called  the  "natural" 
parametr ization;  and  0 is  called  the  "natural"  parameter.  If 
the  set  n is  open  and  convex  and  if  the  component  functions  of 
t(x)  together  with  the  function  which  is  identically  1 on  Rn 
are  linearly  independent  functions  on  the  intersection  of  the 
supports  of  b(x)  and  ti  , then  there  is  another  parametr  ization 
of  the  members  of  the  family,  called  the  "expectation"  or  "mean 
value"  parametnzation,  in  terms  of  the  "expectation"  parameter 

P - E(t(X)  ! 0 ) - | t(x)q(xl0)d/i  . 

J Rn 

Indeed,  under  these  conditions  on  n and  t(x)  , one  can  show 


that 
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[E(t(X)  I 0 * ) - E(t(X) l0)]T(0,-0)  > 0 

whenever  0'  «»  0 ; and  it  follows  that  the  assignment 
0 - p - E(t(X) 1 0)  is  one-to-one  and  onto  from  n to  an  open  set 
n c R*  . In  fact,  the  correspondence  0~p  - E(t(X)l0)  is  a 
both-ways  continuously  differentiable  mapping  between  n and  n . 
(See  Barndorf f-Nielsen  [6;  p.  121].)  So  under  these  conditions 
on  n and  t(x)  , one  can  represent  the  members  of  the  family  as 

T 

p(xlp)  - q(xl  0(p)  ) - a(p)"1b(x)e0(p)  fc(x)  , x € Rn  , (5.2) 


for  pen,  where  0(p)  satisfies  p - E(t(X)l0(p))  and 
a(0(p))  is  written  as  a(p)  for  convenience.  Note  that  p(xlp) 
is  continuously  differentiable  in  p , since  q(xlo)  is 
continuously  differentiable  in  0 and  0(p)  is  continuously 
differentiable  in  p . 


Now  suppose  that  a parametric  family  of  mixture  densities  of 

* * * * £ 

the  form  (1.1)  is  given,  with  <1>  - (a^,  • • • ,am,p^,  • • • ,Pm)  the 

"true"  parameter  value  to  be  estimated;  and  suppose  that  each 
component  density  p^(xlp^)  is  a member  of  an  exponential 
family.  Specifically,  we  assume  that  each  Pi(xlp^)  has  the 

ni 

"expectation"  par ametr izat ion  for  p^  e c R given  by 


I -1  0i 

Pl(xlpJL)  - a£(Pi)  bi(x)e 


<Pi,Tti(x) 


x e 


,n 


i 


Rn  - 


a (p^)  is  given  by 


where  b 
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&t(P i) 


-1 


e t (^4 ) t, (x) 
bi(x)e  1 1 1 dM 


Hi  _ 

and  ®i  : **  •'  • Here,  n is  a measure  on  R 

appropriate  for  the  mixture  density  p(xl&)  : and  it  is  understood 
that  b^x)  > 0 for  x c Rn  and  that  C-P ^ ) < m for  p^  e . 
It  is  also  assumed  that  the  component  functions  of  t^(x) 
together  with  the  function  which  is  identically  1 on  Rn  are 
linearly  independent  on  the  intersection  of  the  supports  of 
bA(x)  and  m and  that  the  assignment  p ^ -•  0^(.p^)  maps  fl^  to 

ni 

a convex  open  set  c R in  a one-to-one,  onto  way  so  that 

ni  i 

0 i ( ^ i ) is  the  unique  solution  in  R of  p^  - E(t(X)le^)  for 
p^  e fij,  . These  assumptions  allow  us  to  make  use  of  the 
"natural”  parametr ization  of  the  family  to  which  p^(xl^) 
belongs  using  the  "natural"  parameter  0^  - 9^  (a^)  . 


To  investigate  the  special  form  and  properties  of  the  EM 
algorithm  for  the  given  family  of  mixture  densities,  we  assume 
that  are  mutually  independent  variables  and  consider 
for  convenience  a sample  •••  N °*-  Type  1.  (A 
discussion  similar  to  the  following  is  valid  mutatis  mutandis  for 


samples  of  other  types . ) 

If 

®c 

/ c 
- (a^, 

,P*)  is  a 

current  approximate  maximizer 

of 

the  log-likelihood  function 

given  by  (3.1), 

then 

the 

next 

approximate 

maximizer 

_ 4*  . + + 4*  + . 

® - <«!•  •••■<>„) 

pr  e 

scribed  by 

the  M-step 

of  the  EM 

algorithm  satisfies  (4.5) 

and 

(4.6) 

. For 

i “ 1, • * • ,m 

, what  p+ 

satisfy  (4.6)?  If  one  replaces  each  p (x.l^>  ) in  the  sum  in 

X i\  1 
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(4.5)  by  its 

expression 

in 

the 

"natural"  parameter  0^  , 

differentiates 

with  respect 

to 

, equates 

the  sum  of 

derivatives  to 

zero,  and 

f inally 

restores  the 

"expectation" 

parametr iz&tion,  then  one  see3  that  the  unique  pt  which 
satisfies  (4.6)  is  given  explicitly  by 


■f 


N 

( E 

k-l 


ti(xk) 


P(xklo>c) 


K 

}/{  E 

k-l 


p(xklu>c; 


(5.3) 


(As  in  the  case  of  (4.8)  and  (4.9),  the  factors  ar^  are  left  in 
the  numerator  and  denominator  for  aesthetic  reasons  only. ) 


Not  only  are  (4.5)  and  (5.3)  easily  evaluated  and 

4-  Q 

hour ist ically  appealing  formulas  for  determining  <I>  from  <X>  , 

they  also  provide  the  kev  to  a global  convergence  analysis  of 
iteration  sequences  generated  by  the  EM  algorithm  in  the  case  at 
hand  which  goe3  beyond  Theorem  4.1.  Theorem  5.1  below  summarizes 
such  an  analysis.  In  order  to  make  the  theorem  complete  and 
self-contained,  some  of  the  general  conclusions  of  Theorem  4.1 
are  repeated  : its  statement. 


Theorem 

5.1: 

Suppose 

that 

(•«>  - 

(a[3)  , • • • 

aO) 

m 

JD  ...  -d>M 

’P1  ’ '*m  ' 3-0, 1, 2, 

is 

a sequence 

in  n 

generated 

by 

the  EM  iteration  (4.5) 

and  ( 5 

.3) . 

Then 

la(®) 

increases 

monotonically 

on 

-0,1,2, • 

. . 

to  a 

( poss i bl 

y infinite) 

limit  L 

Furthermore,  for 

each 

i , 

1,2,  • • • 

is 

contained 

in  the 

convex 

hull  of 

•W* 

k-l,  • • • ,N 

• 

Consequently 

, the  set 

of 

all 

limit 
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points  of  {<£>^3 ) }^_q  ^ 2 is  compact;  and  the  likelihood 

equations  (3.5)  and  (3.8)  are  satisfied  on  <$£  **  n n . If 
§ t then  L is  finite  and  each  & e satisfies  L(&)  - L 
and  is  a fixed-point  of  the  EM  iteration.  Finally,  if 
c n , then  is  connected  as  well  as  compact. 

Remark:  If  the  convex  hull  of  (t.(x,  )}.  , v,  is  contained  in 
nA  for  each  i , then  K)  * c n and  all  of  the  conditional 

conclusions  of  Theorem  5.1  hold.  The  convex  hull  of 

{ti(xk)  *k-l  jj  is  indeed  contained  in  for  each  i in 

many  (but  not  all)  applications.  (See  the  examples  at  the  end  of 
this  section.) 

Proof:  One  sees  from  (5.3)  that  for  each  i , ^ is  always  a 

convex  combination  of  the  values  (t,(x.)).  , „ , and  it 

i k k-1, • • • ,N 

follows  that  n . - is  contained  in  the  convex  hull 

i j-0,1,2, • • • 

of  {t.(x,  )}.  , „ for  each  i . Since  these  convex  hulls 

1 i k k-1 , • • • , N 

are  compact  sets,  one  concludes  that  £,  is  compact. 

Now  each  density  p ^ ( x I ■#> ^ ) is  continuously  differentiable 

in  on  nL  , and  so  it  is  clear  from  (3.1)  and  (4.1)  that 

L^('K)  and  Q^(®l©')  are  continuous  in  <C>  and  O'  and 

dif f erentable  m •••,*>  in  fl  . Furthermore,  it  is  apparent 

+•  c 

from  (4.5)  and  (5.3)  that  <I>  depends  continuously  on  <J>  and 

one  sees  from  the  discussion  following  Theorem  4.1  that 

lim  + ^ - <t^-^  li  - 0 if  - £ c n . In  light  of  these 

3 ’*C° 

points,  one  verifies  the  remaining  conclusions  of  Theorem  5.1  via 
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a straightforward  application  of  Theorem  4.1  (including  statement 
(vi)‘);  and  the  proof  is  complete. 

One  can  also  exploit  (4.5)  and  (5.3)  to  obtain  a local 
convergence  result  which  goes  beyond  Theorem  4.3  for  mixture 
density  estimation  problems  of  the  type  now  under  consideration. 
Theorem  5.2  and  its  proof  below  provide  not  only  a stronger  local 
convergence  statement  than  Theorem  4.3  for  a sequence  of  iterates 
produced  by  the  EM  algorithm  but  also  a means  of  both  quantifying 
the  speed  of  convergence  of  the  sequence  and  gaining  insight  into 
properties  of  the  mixture  density  which  affect  the  speed  of 
convergence.  This  theorem  is  essentially  the  generalization  of 
Redner  [108]  of  the  local  convergence  results  of  Peters  and 
Walker  [101]  for  mixtures  of  multivariate  normal  densities,  and 
its  proof  closely  parallels  the  proofs  of  those  results. 

Theorem  S^-2:  Suppose  that  the  Fisher  information  matrix  I (<X>) 

* 

given  by  (2.^.1)  is  positive-definite  at  <X>  and  that 

* . * * * * * 

4>  - (a.  , • • • , a ,*>.  , • • • ,p  ) is  such  that  a.  > 0 for 

i m l m i 

i - l,***,m  . For  <x/0^  in  fl  , denote  by  ^ Jj-o  12 

sequence  in  fl  generated  by  the  EM  iteration  (4.5)  and  (5.3). 
Then  with  probability  1,  whenever  N is  sufficiently  large,  the 
unique  strongly  consistent  solution  - (a^,  • • • , a^,  p^,  • • • ,^) 
the  likelihood  equations  is  well-defined  and  there  is  a certain 
norm  I!  ■ II  on  n in  which  ^ j ...  converges  linearly 

to  <1>N  whenever  is  sufficiently  near  ©^  , i.e.,  there  is 

a constant  x , 0 < X < 1 , for  which 
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- *NII  < l - ®NII  , j - 0,1,2, 


(5.4) 


whenever  is  sufficiently  near 


Proof : One  sees  not  only  that  Condition  2 of  Section  3 is 

satisfied  but  also,  by  restricting  n to  be  a 3mall  neighborhood 

* 

of  <x>  if  necessary,  that  Condition  1 holds  as  well  for  the 

family  of  mixture  densities  under  consideration.  It  follows  from 

N 

Theorem  3.1  that  with  probability  1,  <X>  is  well-defined  whenever 

* 

N is  sufficiently  large  and  converges  to  4>  as  N approaches 
infinity.  It  must  be  shown  that  with  probability  1,  whenever  N 
is  sufficiently  large,  there  is  a norm  II  • II  on  n and  a 
constant  X , 0 < X < 1 , such  that  (5.4)  holds  whenever  <X>^0^  is 

sufficiently  near  <X>N  . Toward  thi3  end,  we  observe  that  the  EM 
iteration  of  interest  is  actually  a functional  iteration 

■f  Q 

<X>  - G(4>  ) , where  G(<2>)  is  the  function  defined  m the  obvious 

way  by  (4.5)  and  (5.3).  Note  that  G(«>)  is  continuously 

differentiable  in  n and  that  any  $ which  satisfies  the 

A N 

likelihood  equations  (3.5)  and  (3.8)  (and  <£>  - © in  particular) 
is  a f lxed  point  of  G(<X>)  , i.e.,  & » G($)  . Consequently  one 

can  write 


+ N c N 

<D  - <X>  - G(<X>  ) - G(<3> ) 


- G'  (<I>N)  (4>C-<J>N)  + 0(  II  «a>C-0N  I!  2 ) (5.5) 

for  any  <X>C  in  n near  and  any  norm  II  • II  on  n , where 
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G*(<l>N)  denotes  the  Prcchet  derivative  of  G($)  evaluated  at 

. (For  questions  concerning  Frechet  derivatives,  see,  for 

example,  Luenberger  [83].)  We  will  complete  the  proof  by  showing 

N 

that  with  probability  1,  G'  (4>  ) converges  as  N approaches 
infinity  to  an  operator  which  has  operator  norm  less  than  1 with 
respect  to  a certain  norm  on  fl  . 

For  convenience,  we  introduce  the  following  notation  for 
i - 1, • • • ,m  : 

/^(x)  - p^xl^/pOcI*11)  , 

at  - f [tA(x)  - 0*JHti(x)  - ^)TPi(x!^)d/i  , 

J R" 

7i(X)  " ^ * 

m 

Regarding  an  element  O e n a3  an  (m  + £ n.)-vector  i.i  the 

i-1  1 

natural  way,  one  can  show  via  a very  tedious  calculation  that 

mm 

G'(<p)  has  the  (m  + £ n,)*(m  + £ n. ) matrix  representation 

i-1  1 i-1  1 

C'(<I>N)  - diag(l,  • • • ,1,^  £ ) c ^ v ^ t x > T , • * - , 

k*  1 


1 

N 


N T 

£ £_(x.  )o  y (x.W  (x.)  ) 
, m K mm  K m K 
K-l 


- diag(a^, • • • , a^, 


°m>< 


1 N 

> ' 


where 
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V(x)  - (>S1Cx) , • • • ,^m(x)  ,a1>91(x)y1(x),r,  • • • ,am>Sm(x)ym(x)T)T  . 
w * 

(Since  4>  converges  to  ® with  probability  1,  wo  can  assume 

N 

that  with  probability  1,  each  is  non-zero  whenever  N is 

sufficiently  large.)  It  follows  from  the  Strong  Law  of  Large 

✓ N 

Numbers  (see  Loeve  [82])  that  with  probability  1,  G'(<t>  ) 

* 

converges  to  E[G*(<t>  )]  - I - QR  , where 


Q - diag(a*, 


* 

vai  °v 


,-i 


m 


and 

R - J"  V(x)V(x)Tp(xl<l>*)dM  . 

J Rn 

It  is  understood  that  in  these  expressions  defining  Q and  R , 

N * 

<t>  and  it3  components  have  been  replaced  by  <E>  and  its 

components . 

It  remains  to  be  shown  that  there  is  a norm  11  *11  on  fl 

* 

with  respect  to  which  E[G'(®  )]  has  operator  norm  less  than  1. 
Now  Q and  R are  positive-definite  symmetric  operators  with 
respect  to  the  Euclidean  inner  product,  and  so  QR  is  a 
positive-definite  symmetric  operator  with  respect  to  the  inner 

product  < • , • > defined  by  <U,W>  - UTQ  H/  for 

m 

(m  + £ n,  ) -vectors  U and  W . Consequently,  to  prove  the 
i-1  1 

theorem,  it  suffices  to  show  that  the  operator  norm  of  QR  with 
respect  to  the  norm  defined  by  <•»•>  13  less  than  1. 
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Since  QR  is  postive-def inite  symmetric  with  respect  to 
<•#•>  , we  need  only  show  that  <U,QRU>  < <U,U>  for  an  arbitrary 


m „ 

(m  + E n,)-vector  0 - (0, , • • • , fln, ft , • • • ,yJ)T  . One  has 

1 X lU  JL  lu 


<U,QRU>  - UaRO 


■ | CjE^i^x)  + E yif“i/si(x)7t(x)]}2p(xla>*)dM 

J Rn  1-1  1-1 

- | f E 1 + ^iyi,(x)  l0fi^i(x) ) 2p(x  !4>*)d^t 

„n 


f 10  T 7 * | * 

< E [0^  + ^r^x)]  oipi(xUi)dM  - 

r"  i_1 

The  inequality  is  a consequence  of  the  following  corollary  of 
Schwarz's  inequality:  If  7j^  > 0 for  i - 1,  • • • ,m  and  if 


m 


m 


E Vi  - 1 , 
i-1  1 


then 


2 m ? 
i ✓ T-*  P 


{ E < E Ct, 

i-1  11  i-1  1 1 


for  all 


{£ , } , , „ . Since 

1 i i - 1, • • • ,m 


R 


one  continues  to  obtain 


r^JOp^xU^dit  “ 0 


<u 


,QRU>  < | E^tO^a*  + ^r1(x)Ti(x)T?i]a*pi(xl/i )d/x 
Rn  1-1 


<ufu>  . 
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This  completes  the  proof. 

It  is  instructive  to  explore  the  consequences  of  Theorem  5.2 

and  the  developments  in  its  proof.  One  sees  from  the  proof  of 

Theorem  5.2  that  with  probability  1,  for  sufficiently  large  N 

and  sufficiently  near  <X>N  , an  inequality  (5.4)  holds  in 

which  II  • II  is  the  norm  determined  by  the  inner  product  <*,•> 

defined  in  the  proof  and  X is  arbitrarily  close  to  the  operator 

norm  of  E[G'(<i>  ) ) - I - QR  determined  by  II  • II  . Since  QR  is 

positive-definite  symmetric  with  respect  to  <•,•>  » this 

operator  norm  is  ju3t  p(I-QR)  , the  spectral  radiU3  or  largest 

absolute  value  of  an  eigenvalue  of  I - QR  . Thus  with 

probability  1,  one  can  obtain  a quantitative  estimate  of  the 

N 

speed  of  convergence  to  <I>  for  large  N of  a sequence  generated 
by  the  EM  iteration  (4.5)  and  (5.3)  by  taking  X p(I-QR)  m 
(5.4). 

What  properties  of  the  mixture  density  influence  the  speed 
N 

of  convergence  to  <X>  of  an  EM  iteration  sequence  for  large  N ? 
Careful  inspection  shows  that  if  the  component  populations  in  the 
mixture  are  "well  separated"  in  the  sense  that 

P..(xlP*)  n 

. J M o for  x e R , whenever  i »*  j , 

p(xl®  ) p ( x l O ) 

then  QR  •»  I . It  follows  that  p(I-QR)  « 0 , and  an  EM 

N 

iteration  sequence  which  converges  to  <t>  exhibits  rapid  linear 
convergence.  On  the  other  hand,  if  the  component  populations  in 
the  mixture  are  "poorly  separated"  in  the  3ense  that,  say,  the 


(3r  t'vi'-'1- 


■i  1 I 


i 


th 


and 


component  populations  are  such  that 


Pi(xU»*) 

P(xI<k‘) 


PaCxI^) 

-3 «-  for 

p(xlO  ) 


X 6 


,n 


» 


then  R is  nearly  singular.  One  concludes  that  p(I-QR)  - 1 in 
this  case  and  that  slow  linear  convergence  of  an  EM  iteration 
sequence  to  <1>^  can  be  expected. 

In  the  interest  of  obtaining  iteration  sequences  which 
converge  more  rapidly  than  EM  iteration  sequences,  Peters  and 
Walker  [101],  [102]  and  Redner  [108]  considered  iterative  methods 
which  proceed  at  each  iteration  in  the  EM  direction  with  a step 
whose  length  is  controlled  by  a parameter  e . In  the  present 
context,  these  methods  take  the  form 


<X>+  . Fe(<»c)  = (l-c)<X>C  + eG(4>C)  , 


(5.6) 


where  G(4>)  is  the  EM  iteration  function  defined  by  (4.5)  and 

N 

(5.3).  The  idea  is  to  optimize  the  speed  of  convergence  to  <I> 
of  an  iteration  sequence  generated  by  such  a method  for  large  N 
by  choosing  e to  minimize  the  spectral  radius  of 
E[F*  (<!>*))  - I - eQR  . As  in  [101],  [102]  and  [108],  one  can 

easily  show  that  the  optimal  choice  of  c is  always  greater  than 
one,  lies  near  one  if  the  component  populations  in  the  mixture 
are  "well-separated"  in  the  above  sense,  and  cannot  be  much 
smaller  than  two  if  the  component  populations  are  "poorly 
separated"  in  the  above  sense.  The  extent  to  which  the  speed  of 
convergence  of  an  iteration  sequence  can  be  enhanced  by  making 
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the  optimal  choice  of  e in  (5.6)  is  determined  by  the  length  of 
the  sub  interval  of  (0,1]  in  which  the  spectrum  of  QR  lies. 
(Greater  improvements  in  convergence  speed  are  realized  from  the 
optimal  choice  of  e when  this  subinterval  is  relatively  narrow.) 
The  applications  of  iterative  procedures  of  the  form  (5.6)  are  at 
present  incompletely  explored  and  might  well  bear  further 
investigation. 

We  conclude  this  section  by  briefly  reviewing  some  of  the 
special  forms  which  tne  EM  iteration  takes  when  a particular 
component  density  p^fxl^)  is  a member  of  one  of  the  common 
exponential  families.  We  also  comment  on  some  convergence 
properties  of  sequences  ^ 2 generated  by  the  EM 
algorithm  in  the  examples  considered.  Hopefully,  our  comments 
will  prove  helpful  in  determining  convergence  properties  of  EM 
iteration  sequences  n . , through  the  use  of  Theorem 
5.1  or  other  means  when  all  component  densities  are  from  one  or 
more  of  these  example  families. 

Example  1:  Poisson  density.  In  this  example,  n - 1 and  a 
natural  choice  of  fh  is  - (p  ^ e R^:  0 < p^  < «}  . For 
P^  € , one  has 

Pi(xl*i)  * 77  e , x - 0,1,2, •••  ; 
and  the  EM  iteration  (5.3)  for  a sample  of  Type  1 becomes 


j;  j 
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P,  - 


N 

{ e 

k-1 


--■-  •■  c 1/C  E 

pUki®  ) k-1 


P(xkl®c) 


Note 

that  p* 

is  always  contained  in  the 

convex 

hull 

of 

txk}k 

-1  • • • N ' *3  a compact  subset  of 

. Therefore, 

the 

set 

of  limit 

points  of  an  EM  iteration 

sequence 

(*«> 

} j“0,l,2, • • • 

is  a nonempty  compact  subset 

of  n.  . 

Example  2^:  Binomial  density.  Here,  n - 1 and  one  naturally 

chooses  to  be  the  open  set  e R1:  0 < < 1}  . For 

P±  € n.  , p1(xUi)  is  given  by 

V V "~X 

pi(x^i)  “ ( 1 ’ x “ * 

for  a prescribed  integer  . In  this  case,  the  EM  iteration 

(5.3)  for  a sample  of  Type  1 becomes 


+ 


N 

E 

k-1 


glPi(V*i> 

p(XjJ4>C) 


N 

}/{  E 

k-1 


gipi(xk>pi)) 
p(XK* C>C) 


Since  p^(xl^^)  is  non-zero  only  if  x - 0,1,  one  sees 

from  this  expression  that  the  set  of  limit  points  of  an  EM 
iteration  sequence  i ? ...  *8  a nonempty  compact 

subset  of  e R^:  0 < p^  < 1)  . 


Example  3^:  Exponential  density. 

ni  - € R^“:  0 < p ^ < «)  . 


Again,  n - 1 and  one  takes 
■P  ^ e n , one  has 


For 


The  EM  iteration  (5.3)  for  a sample  of  Type  1 now  becomes 


N N °?p<(x.i^) 


k-1 


p(xki®  ) k-l  p(xki<jr) 


and  one  sees  that  the  set  of  limit  points  of  an  EM  iteration 
sequence  i 2 •••  *s  a nonempty  compact  subset  of  fi^  . 


Example  £:  Multivariate  normal  density.  In  this  example,  n is 

an  arbitrary  positive  integer;  and  p^  is  most  conveniently 
represented  as  p^  - , whore  fi ^ e Rn  and  E^  is  a 

positive-definite  symmetric  nxn  matrix.  (Of  course,  this 
representation  of  p ^ is  not  the  usual  representation  of  the 
"expectation"  parameter.)  Then  is  the  set  of  all  such  p^  , 

and  pi(xl^i)  is  given  by  (4.7).  For  a sample  of  Type  1,  the  EM 
iteration  (5.3)  becomes  that  of  (4.8)  and  t.4.9). 

One  can  see  from  (4.9)  that  each  E^  is  in  the  convex  hull 
+ + T 

of  { (x.~/i , ) (x,-/*  ) },  1 , a set  of  rank-one  matrices 

rL  1 K 1 K"JL , • • • rN 

which,  of  course,  are  not  positive  definite.  Thus  there  is  no 
guarantee  that  a sequence  of  matrices  {£^)j_g  1 2 •••  Produced 
by  the  EM  iteration  will  remain  bounded  from  below.  Indeed,  it 
ha3  been  observed  in  practice  that  sequences  of  iterates  produced 
by  the  EM  algorithm  for  a mixture  of  multivariate  normal 
densities  do  occasionally  converge  to  "singular  solutions'  (cf. 
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Duda  and  Hart  [44] ),  i.e.,  points  on  the  boundary  of  with 

associated  singular  matrices. 


It  was  observed  by  Hosmer  [68]  that  if  enough  labeled 
observations  are  included  in  a sample  on  a mixture  of  normal 
densities,  then  with  probability  1,  the  log-likelihood  function 
attains  its  maximum  value  at  a point  at  which  the  covariance 
matrices  are  positive  definite.  Similarly,  consideration  of 
samples  with  a sufficiently  large  number  of  labeled  observations 
alleviates  with  probability  1 the  problem  of  an  EM  iteration 
sequence  having  "singular  solutions"  as  limit  points.  For 
example,  if  one  considers  a sample  S - S^  u S3  which  i3  a 

stochastically  independent  union  of  a sample  - C ^ ...  N 

m 

of  Type  1 and  a sample  S3  « u ( z ^ > jc_ ^ ...  K of  Type  3,  then 
the  EM  iteration  becomes 


+ „ _1_,  *?  g?Pj<xK,*x> 
i N+K  k-1  p(XkK,c) 


+ Kt} 


N ctCp  (X  \pC)  Ki  N ^(V^) 

- { E xk  1 1 1 - + E zik}/{  E — — + K } 

1 k-1  K p(xki<x>c)  k-1  k-1  p(xk!Oc)  1 


N 


X ••  + + i 

- { E (x.-^)(x  -m-)T 

1 k-1  k 1 K 1 p(xkl®C) 


K, 


+ .T,  , , * “iPi^k1^5 


+ E <zik"MI)(zik"|Ii)  )/{  E c 

k-1  lK  1 1K  1 k-1  p(xkloc) 


+ Ki)  , 


J 


«J  I *> 
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» . 

where  K « £ K,  . One  sees  from  the  expression  for  £.  that 

i-1  1 i 


Zi  is  bounded  below  by 


, T -„  + w,  V}Vf  ? “iPi(VPi) 

MiH2ik  V }/{,L, — : — rcT  + V f 

K“i  k-l  p ( x.  i «t»  ) 1 


which  ia  in  turn  bounded  below  by 


N+K 


1 { E (zik-z)(zlk-z)T}  , 


i k-l 


where 


z 


1 Ki 

IT  E 

i k-l 


•ik 


Now  this  la3t  matrix  is  positive-definite  with  probability  1 
whenever  > n . Consequently,  if  > n , then  with 

probability  1,  the  elements  of  a sequence  12 

produced  by  the  EM  algorithm  are  bounded  below  by  a positive 
definite  matrix;  hence,  such  a sequence  cannot  have  singular 
matrices  as  limit  points. 


6.  Performance  of  the  EM  Alcorithm 
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In  this  concluding  section,  we  review  and  summarize  features 
of  the  EM  algorithm  having  to  do  with  its  effectiveness  in 
practice  on  mixture  density  estimation  problems.  As  always,  it 
is  understood  that  a parametric  family  of  mixture  densities  of 

the  form  (1.1)  is  of  interest  and  that  a particular 

* * * * * 

“ (a^,  • • • , amfP , i ‘ * * »^m)  is  the  "true"  parameter  value  to  be 
estimated. 

In  order  to  provide  some  perspective,  we  begin  by  offering  a 

brief  description  of  the  most  basic  forms  of  several  alternative 

methods  for  numerically  approximating  maximum-likelihood 

estimates.  In  describing  these  methods,  it  is  assumed  for 

convenience  that  the  sample  at  hand  is  a sample 

S.  - (x.  }.  . of  Type  1 described  in  Section  3 and  that  one 

JL  k k"l  i * * * f N 

T 

can  write  O as  a vector  O - (£^f‘**,Cv)  of  unconstrained 
scalar  parameters  at  points  of  interest  in  n . Each  of  the 
methods  to  bo  described  seeks  a maximum-likelihood  estimate  by 
attempting  to  determine  a point  8 such  that 

v<t>L1(8)  - 0 , (6.1) 

wheie  L^(S>)  is  the  log-likelihood  function  given  by  (3.1).  The 
features  of  the  methods  which  concern  us  here  are  their  speed  of 
convergence,  the  computation  and  storage  required  for  their 
implementation,  and  the  extent  to  which  their  basic  forms  need  to 
be  modified  in  order  to  make  them  effective  and  trustworthy  in 
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practice. 


The  first  of  the  alternative  methods  to  be  described  is 
Newton's  method.  It  is  the  method  on  which  the  other  methods 
reviewed  here  are  modeled,  and  it  is  given  as  follows:  Given  a 

current  approximation  <X>C  of  a solution  of  (6.1),  determine  a 
next  approximation  <I>  by 


<X>f  - 4>C  - H(<T>C)_1V(I>L1(a)C)  . 


(6.2) 


The  function  H(C>)  in  (6.2)  is  the  Hessian  matrix  of  L^(<X>) 
given  by  (3.10). 

Under  reasonable  assumptions  on  L^(4>)  , one  can  show  that  a 
sequence  of  iterates  n . , produced  by  Newton's 

method  enjoys  quadratic  local  convergence  to  a solution  8 of 
(6.1)  (see,  for  example,  Ortega  and  Rheinboldt  [95]).  This  i3  to 
say  that  given  a norm  II  • II  on  n , there  is  a constant  0 3uch 
that  if  is  sufficiently  near  8 , then  an  inequality 

IUd*l>  - $11  < - $ll2  (6.3) 


holds  for  j - 0,1,2,...  . Quadratic  convergence  is  ultimately 
very  fast,  and  it  is  regarded  as  the  major  strength  of  Newton's 
method.  Unfortunately,  there  are  aspects  of  Newton's  method 
which  are  associated  with  potentially  severe  problems  in  some 
applications.  For  one  thing,  Newton's  method  requires  at  each 
iteration  the  computation  of  the  v x v Hessian  matrix  and  the 
solution  of  a system  of  v linear  equations  (at  a cost  of  O(v^) 
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arithmetic  operations  in  general)  with  this  Hessian  as  the 
coefficient  matrix;  thus  the  computation  required  for  an 
iteration  of  Newton's  method  is  likely  to  become  expensive  very 
rapidly  as  m ,n  , and  N grow  large.  (It  should  al3o  be 
mentioned  that  one  must  allow  for  the  storage  of  the  Hessian  or 
some  set  of  factors  of  it.)  For  another  thing,  Newton's  method 
in  its  basic  form  (6.2)  requires  for  some  problems  an 
unpractically  accurate  initial  approximate  solution  4^^  in 
order  for  a sequence  of  iterates  n . , to  converge 

to  a solution  of  (6.1).  Consequently,  in  order  to  be  regarded  as 
an  algorithm  which  is  safe  and  effective  on  applications  of 
Interest,  the  basic  form  (6.2)  is  likely  to  require  augmentation 
with  some  procedure  for  enhancing  the  global  convergence  behavior 
of  sequences  of  iterates  produced  by  it.  Such  a procedure  should 
be  designed  to  insure  that  a sequence  of  iterates  not  only 
converges  but  also  does  not  converge  to  a solution  of  (6.1)  which 
is  not  a local  maximum  of  L^(4>)  . 

A broad  class  of  methods  which  are  based  on  Newton's  method 
are  quasi-Newton  methods  of  the  general  form 

4>+  - 4>C  - B-1v  I.  (<&c)  , (6.4) 

c 

in  which  B is  regarded  as  an  approximation  of  H(<X>  ) . Methods 
of  the  form  (6.4)  which  are  particularly  successful  are  those  in 

c 

which  the  approximation  B - H(<I>  ) is  maintained  by  doing  a 
secant  update  of  B at  each  iteration  (see  Dennis  and  More  [40] 


or  Dennis  and  Schnabel  [41]).  In  the  applications  of  interest 


here,  such  updates  are  typically  realized  as  rank  one  or  (more 
likely)  rank  two  changes  in  B . Methods  employing  such  updater 
have  the  advantages  ovei  Newton's  method  of  not  requiring  tht 
evaluation  of  the  Hessian  matrix  at  each  iteration  and  of  being 
implementable  in  ways  which  require  only  O(v^)  arithmetic 
operations  to  solve  the  system  of  v linear  equations  at  each 
iteration.  The  price  paid  for  these  advantages  is  that  the  full 
quadratic  convergence  of  Newton's  method  is  lost;  rather,  under 
reasonable  assumptions  on  L^(<I>)  , a sequence  of  iterates 
{($(3))  produced  by  one  of  these  methods  can  only  be 

J"U  r ir  « • • 

shown  to  exhibit  local  superlinear  convergence  to  a solution  $ 
of  (6.1),  i.e.,  one  can  only  show  that  if  a norm  11-11  on  n is 
given  and  if  <t>^  is  sufficiently  near  $ (and  an  initia) 
approximate  Hessian  is  sufficiently  near  H($))  , then 
there  exists  a sequence  C ^ ) -j _q  2 which  converges  to  zero 
and  is  such  that 

IU«+1>  - $11  < g iu<J>  - $11 

for  j - 0,1,2,...  . Like  Newton's  method,  methods  of  the 
general  form  (6.4),  including  those  employing  secant  updates,  are 
likely  to  require  augmentation  with  safeguards  to  enhance  global 
convergence  properties  and  to  insure  that  iterates  do  not 
converge  to  solutions  of  (6.1)  which  are  not  local  maxima  of 
1^(0)  . 


Finally,  we  describe  a particular  method  of  the  form  (6.4) 
which  i 8 specifically  formulated  for  solving  likelihood 


equations.  This  is  the  method  of  scoring,  mentioned  earlier  in 
connection  with  the  work  of  Rao  [107]  and  reviewed  in  a general 
setting  by  Kale  [74],  [75].  (Kale  [74],  [75]  also  discusses 
modifications  of  Newton's  method  and  the  method  of  scoring  in 
which  the  Hessian  matrix  or  an  approximation  of  it  is  held  fixed 
for  some  number  of  iterations  m the  hope  of  reducing  overall 
computational  effort.)  In  the  method  of  scoring,  one  ideally 
chooses  B in  (6.4)  to  be 


B - -N I ( <X>C ) , (6.5) 

where  I(<t)  is  the  Fisher  information  matrix  given  by  (2.5.1). 
Since  the  computation  of  I(<I>C)  is  likely  to  be  prohibitively 
expensive  for  most  mixture  density  problems,  a mor'.  ^rt-ealing 
choice  of  B than  (6.5)  might  be  the  sample  approximation 

b - - J^tvog  p(xki®c)nv°g  p(*ki3)C>3T  • (6-6> 

The  choice  (6.6)  can  be  justified  in  the  following  manner:  The 

Hessian  H(<3>)  is  given  by 

H(<X>)  - - E [v^log  p(xkl<J>)  ] [v^log  p(xkll>)]T  + 

* Ji  vW*>  • <6-7> 

* 

Now  the  second  sura  in  (6.7)  has  zero  expectation  at  <I>  - <*>  ; 

furthermore,  since  the  terms  v^log  p(xkl<I>)  must  be  computed 


in 
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order  to  obtain  , the  first  sum  in  (6.7)  is  available  at 

2 

the  cost  of  only  0(Nv  ) arithmetic  operations  while  determining 
the  second  sum  is  likely  to  involve  a great  deal  more  expense. 
Thus  (6.6)  is  a choice  of  B which  is  readily  available  at 
relatively  low  cost  and  which  is  likely  to  constitute  a major 
part  of  H(4>c)  when  N is  large  and  <S»C  is  near  <J>  . It  is 

clear  from  this  discussion  that  the  raetnod  of  scoring  with  B 
given  by  (6.6)  is  an  analogue  for  general  maximum-likelihood 
estimation  of  the  Gauss-Newton  method  for  nonlinear  least-squares 
problems  (see  Ortega  and  Rheinboldt  [95]).  If  the  computation  of 
I(4»c)  is  not  too  expensive,  then  the  choice  of  B given  by 
(6.5)  can  be  justified  in  much  the  same  way. 

The  method  of  scoring  in  its  basic  form  requires  0(Nv1 2) 

arithmetic  operations  to  evaluate  B given  by  (6.6)  and  O(v^) 

arithmetic  operations  to  solve  the  system  of  v linear  equations 

implicit  in  (6.4).  Since  these  O(Nv^)  arithmetic  operations 

are  likely  to  be  considerably  less  expensive  than  the  evaluation 

of  the  full  Hessian  given  by  (6.7),  the  cost  of  computation  per 

iteration  of  the  method  of  scoring  lies  between  that  of  a quasi- 

Newton  method  employing  a low-rank  secant  update  and  that  of 

Newton's  method.  Under  reasonable  assumptions  on  L^(<1>)  , one 

can  show  that  with  probability  1,  if  a solution  $ of  (6.1)  i3 

* 

sufficiently  near  <X>  and  if  N is  sufficiently  large,  then  a 

( 1 ) 

sequence  of  iterates  {<D  J i generated  by  the  method 

of  scoring  with  B given  by  either  (6.5)  or  (6.6)  exhibits  local 
linear  convergence  to  8 , i.e.,  there  is  a norm  11*11  on  n and 
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a constant  X , 0 < X < 1 , for  which 

||»(3*D  _ $||  < X||„CJ>  - $||  , 3 ..  0.1.2....  (6.8) 

whenever  is  sufficiently  near  $ . If  $ is  very  near  ® 
and  if  N is  very  large,  then  this  convergence  should  be  fast, 
i.e.,  (6.8)  should  hold  for  a small  constant  X . Like  Newton's 
method  and  all  methods  of  the  general  form  (6.4),  the  method  of 
scoring  is  likely  to  require  augmentation  with  global  convergence 
safeguards  in  order  to  be  considered  trustworthy  and  effective. 

Having  reviewed  the  above  alternative  methods,  we  return  now 
to  the  EM  algorithm  and  summarize  its  attractive  features.  Its 
most  appealing  general  property  is  that  it  produces  sequences  of 
iterates  on  which  the  log-likelihood  function  increases 
monotonically . This  monotonicity  is  the  ba3i3  of  the  general 
convergence  theorems  of  Section  4,  and  these  theorems  reinforce  a 
large  body  of  empirical  evidence  to  the  effect  that  the  EM 
algorithm  does  not  require  augmentation  with  elaborate  safeguards 
such  as  those  necessary  for  Newton's  method  and  quasi-Newton 
methods  in  order  to  produce  iteration  sequences  with  good  global 
convergence  characteristics. 

More  can  be  said  about  the  EM  algorithm  for  mixtures  of 
densities  from  exponential  families  under  the  assumption  that 
^l'**’'^m  are  mutually  independent  variables.  One  3ees  from 
(4.5)  and  (5.3)  and  similar  expressions  for  samples  of  types 
other  than  Type  1 that  it  is  unlikely  that  any  other  algorithm 
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would  be  nearly  as  easy  to  encode  on  a computer  or  would  require 

as  little  storage.  In  view  of  (4.5)  and  (5.3),  it  also  seem;. 

that  any  constraints  on  <X>  are  likely  to  be  satisfied,  or  at 

least  nearly  satisfied  for  large  samples.  For  example,  it  is 

clear  from  (4.9)  that  each  ct  generated  by  the  EM  algorithm  for 

a mixture  of  multivariate  normal  densities  is  symmetric  and,  with 

probability  1,  po3itive-def mite  whenever  N > n . Certainly  the 

mixing  proportions  generated  by  (4.5)  are  always  non-negative  and 

sum  to  1 . It  is  also  apparent  from  (4.5)  and  (5.3)  that  the 

computational  cost  of  each  iteration  of  the  EM  algorithm  is  low 

compared  to  that  of  the  alternative  methods  reviewed  above.  In 

the  case  of  a mixture  of  multivariate  normal  densities,  for 

2 

example,  the  EM  algorithm  requires  0(mn  N)  arithmetic 

operations  per  iteration,  compared  to  at  least 
2 4 3 6 

[0^(m  n N)  + 02(m  n )]  for  Newton's  method  and  the  method  of 

2 2 4 

scoring  and  [0^(mn  N)  + 02(m  n )]  for  a quasi-Newton  method 
employing  a low-rank  secant  update  (All  of  these  methods 

require  the  3ame  number  of  exponential  function  evaluations  per 
Iteration.)  Arithmetic  per  iteration  for  the  three  latter 

methods  can,  of  course,  be  reduced  by  retaining  a fixed 
approximate  Hessian  for  some  number  of  iterations  at  the  risk  of 
increasing  the  total  number  of  iterations. 

In  spite  of  these  attractive  features,  the  EM  algorithm  can 
encounter  problems  in  practice.  The  source  of  the  most  serious 
practical  problems  associated  with  the  algorithm  is  the  speed  of 
convergence  of  sequences  of  iterates  generated  by  it,  which  can 
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often  be  annoyingly  or  even  hopelessly  slow.  In  the  case  of 
mixtures  of  densities  from  exponential  families,  Theorem  5.2 
suggests  that  one  can  expect  the  convergence  of  EM  iteration 
sequences  to  be  linear,  as  opposed  to  the  (very  fast)  quadratic 
convergence  associated  with  Newton’s  method,  the  (fast) 
3uperlinear  convergence  associated  with  a quasi-Newton  method 
employing  a low-rank  secant  update,  and  the  (perhaps  fast)  linear 
convergence  of  the  method  of  scoring.  The  discussion  following 
Theorem  5.2  suggests  further  that  the  speed  of  this  linear 
convergence  depends  in  a certain  sense  on  the  separation  of  the 
component  populations  in  the  mixture.  To  demonstrate  the  speed 
of  this  linear  convergence  and  its  dependence  on  the  separation 
of  the  component  populations,  we  again  consider  the  example  of  a 
mixture  of  two  univariate  normal  densities  (see  (1.3)  and  (1.4)). 


Table  6.1  below  summarizes  the  results  of  a numerical 

experiment  involving  a mixture  of  two  univariate  normal  densities 

* 

for  the  choices  of  4>  appearing  in  Table  3.3.  (These  choices 


* 2 2 

were  obtained  as  before  by  taking  - .3  , m °2  “ ^ » and 

* * 

varying  the  mean  separation  . For  convenience,  we  took 

* * 

#2  “ ~tL\  •)  In  this  experiment,  a Type  1 sample  of  1000 

observations  on  the  mixture  was  generated  for  each  choice  of 
* 


C>  ; and  a sequence  of  iterates  was  produced  by  the  EM  algorithm 
(see  (4.5),  (4.8),  and  (4.9))  from  starting  values 


a,'-'  - a<0)  - .5  , 


,<°> 

1 “2 
(0)  2(0) 
* O . 


(0)  _ 


1.5m, 


(0) 


1.5/t. 


and 


- .5 


"l  ‘**"*1  ' M2  -•“'*2 
An  accurate  determination  of  the  limit  of 


"1  2 

the  sequence  was  made  in  each  case,  and  observations  were  made  of 
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the  iteration  numbers  at  which  various  degrees  of  accuracy  wer<' 
first  obtained.  These  iteration  numbers  are  recorded  in  Tablf' 
6.1  beneath  the  corresponding  degrees  of  accuracy;  in  the  table, 
"E"  denotes  the  largest  absolute  value  of  the  components  of  the 


difference 

between  the 

indicated 

iterate  and  the 

limit. 

In 

addition, 

the  spectral 

radius  of 

the  derivative 

of  the 

EM 

iteration  function  at  the  limit  was  calculated  in  each  case  (cf. 
Theorem  5.2  and  the  following  discussion).  These  spectral  radii, 
appearing  in  the  column  headed  by  "p"  in  Table  6.1,  provide 
quantitative  estimates  of  the  factors  by  which  errors  are  reduced 
from  one  iteration  to  the  next  in  each  case.  Finally,  to  give  an 
idea  of  the  point  in  an  iteration  sequence  at  which  numerical 
error  first  begins  to  affect  the  theoretical  performance  of  the 
algorithm,  we  observed  in  each  case  the  iteration  numbers  at 
which  li  ss  of  monotonicity  of  the  log-likelihood  function  first 
occurred;  these  iteration  numbers  appear  in  Table  6.1  in  the 
column  headed  by  "LM" . 

In  preparing  Table  6.1,  all  computing  was  done  in  double 

3 

precision  on  an  IBM  3032.  Eigenvalues  were  calculated  with 
EISPACK  subroutines  TRED1  and  TQL1,  and  normally  distributed  data 
wa3  obtained  by  transforming  uniformly  distributed  data  generated 
by  the  subroutine  URAND  of  Forsythe,  Malcolm,  and  Moler  [46] 
based  on  suggestions  of  Knuth  [79]. 


3.  We  are  grateful  to  the  Mathematics  and  Statistics  Department 
of  the  University  of  New  Mexico  for  providing  the  computing 
support  for  the  generation  of  this  table. 
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* * 
vrv2 

E<10_1 

E<10-2 

E<10-3 

E<10~4 

E<10~5 

E<10~6 

E<10“7 

E<10-8 

LM 

P 

m 

2078 

2334 

2528 

2717 

2906 

3095 

3283 

3472 

3056 

.9379 

0.5 

710 

852 

985 

1117 

1249 

1381 

1513 

1643 

1361 

.9827 

349 

442 

526 

610 

693 

777 

861 

949 

779 

.9728 

□ 

280 

414 

537 

660 

783 

906 

1028 

1151 

887 

.9814 

2.0 

126 

281 

432 

582 

732 

883 

1033 

1183 

846 

.9849 

3.0 

2 

31 

62 

93 

124 

155 

185 

216 

173 

.9280 

tua 

1 

6 

16 

25 

35 

44 

54 

63 

55 

.7864 

6.0 

1 

1 

2 

3 

4 

5 

7 

8 

8 

.2143 

Table  6.1:  Results  of  applying  the  EM  algorithm  to  a problem  involving  a Type  1 

sample  on  a mixture  of  two  univariate  normal  densities  with 

* , 2*  2*  , 

- .3  , 0^^  ° a2  “ 1 • 


A number  of  comments  about  the  contents  of  Table  6.1  are  in 
order.  First,  it  is  clear  from  the  table  that  an  exorbitantly 
large  number  of  EM  iterations  my  be  required  to  obtain  a very 
accurate  numerical  approximation  of  the  maximum-likelihood 
estimate  if  the  sample  is  from  a mixture  of  poorly  separated 
component  populations.  However,  in  such  a case,  one  sees  from 
Table  3.3  that  the  variance  of  the  estimate  is  likely  to  be  such 
that  it  may  be  pointless  to  seek  very  much  accuracy  in  a 


numerical  approximation.  Second,  we  remark  on  the  pleasing 
consistency  between  the  computed  values  of  the  spectral  radius  of 
the  derivative  of  the  EM  iteration  function  and  the  differences 
between  the  iteration  numbers  needed  to  obtain  varying  degrees  of 
accuracy.  What  we  have  in  mind  is  the  following:  If  the  errors 

among  the  members  of  a linearly  convergent  sequence  are  reduced 
more  or  les3  by  a factor  of  p , 0 < p < 1 , from  one  iteration 
to  the  next,  then  the  number  of  iterations  Ak  necessary  to 
obtain  an  additional  decimal  digit  of  accuracy  is  given 
approximately  by  Ak  - log  10/log  p . This  relationship  between 
Ak  and  p is  borne  out  very  well  in  Table  6.1.  This  fact 
strongly  suggests  that  after  a number  of  EM  iterations  have  been 
made,  the  errors  in  the  iterates  lie  almost  entirely  in  the 
eigenspace  corresponding  to  the  dominant  eigenvalue  of  the 
derivative  of  the  EM  iteration  function.  We  take  this  as 
evidence  that  one  might  very  profitably  apply  simple  relaxation- 
type  acceleration  procedures  such  as  those  of  Peters  and  Walker 
[101],  [102]  and  Redner  [108]  to  sequences  of  iterates  generated 
by  the  EM  algorithm. 

Third,  in  all  of  the  cases  listed  in  Table  6.1  except  one. 
we  observed  that  over  95  percent  of  the  change  in  the  loq- 
likelihood  function  between  the  starting  point  and  the  limit  of 
the  EM  iteration  sequence  was  realized  after  only  five 
iterations,  rega-dless  of  the  number  of  iterations  ultimately 

required  to  approximate  the  limit  very  closely.  (The  exceptional 

* * 

case  13  that  m which  ii ^ - 1.0  ; in  that  case,  about  83 
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percent  of  the  change  in  the  log-likelihood  function  was  observed 
after  five  iterations.)  Thi3  suggests  to  U3  that  even  when  the 
component  populations  in  a mixture  are  poorly  separated,  the  EM 
algorithm  can  be  expected  to  produce  in  a very  small  number  of 
iterations  parameter  values  such  that  the  mixture  density 
determined  by  them  reflects  the  sample  data  very  well.  Fourth, 
it  is  evident  from  Table  6.1  that  elements  of  an  EM  iteration 
sequence  continue  to  make  steady  progress  toward  the  limit  even 
after  numerical  error  has  begun  to  interfere  with  the  theoretical 
properties  of  the  algorithm. 

Fifth,  the  apparently  anomalous  decrease  in  p occurring 
* * 

when  - ^2  decreases  from  2.0  to  1.0  happened  concurrently 
with  the  iteration  sequence  limit  of  the  proportion  of  the  first 
population  in  the  mixture  becoming  very  small.  (Such  very  small 

limit  proportions  continued  to  be  observed  in  the  cases  - 

* 

U-2  m 0.5  , 0.2  .)  We  do  not  know  whether  this  decrease  in  the 
limit  proportion  of  the  first  population  indicates  a 3udden 

i * 

movement  of  the  maximum-likelihood  estimate  as  - ii ^ drops 

below  2.0  or  whether  the  iteration  sequence  limit  is  something 

other  than  the  maximum-likelihood  estimate  m the  cases  in  which 

* * 

- fi 2 io  less  than  2.0  . Finally,  we  also  conducted  more 
than  60  trials  similar  to  those  reported  on  in  Table  6.1  except 
with  samples  of  200  rather  than  1000  generated  observations  on 
the  mixture.  The  results  were  comparable  to  those  given  in  Table 
6.1.  It  should  be  mentioned,  however,  that  the  EM  iteration 


sequences  obtained  using  samples  of  200  observations  did 


limit- 


occasionally  converge  to  "singular  solutions,"  i.e., 
associated  with  zero  component  variances.  Convergence  to  such 
"singular  solutions"  did  not  occur  among  the  relatively  small 
number  of  trials  involving  samples  of  1000  observations. 

At  present,  the  EM  algorithm  is  being  widely  applied  not 
only  to  mixture  density  estimation  problems  but  alee  to  a wide 
variety  of  other  problems  as  well.  We  would  like  to  conclude 
this  survey  with  a little  speculation  about  the  future  of  the 
algorithm.  It  seems  likely  that  the  EM  algorithm  in  its  basic 
form  will  find  a secure  niche  as  an  algorithm  useful  ir 
situations  in  which  some  resources  are  limited.  For  example,  the 
limited  time  which  an  experimenter  can  afford  to  spend  writing 
programs  coupled  with  a lack  of  available  library  software  for 
safely  and  efficiently  implementing  competing  methods  could  make 
the  simpli  . cy  and  reliability  of  the  EM  algorithm  very 
appealing.  ' the  EM  algorithm  might  be  very  well  suited  for 
use  on  small  computers  for  which  limitations  on  program  and  data 
storage  are  more  stringent  than  limitations  on  computing  time. 

Although  meaningful  comparison  tests  have  net  yet  been  made, 
it  seems  doubtful  to  us  that  the  unadorned  EM  algorithm  can  be 
competitive  as  a general  tool  with  well-de3 igned  general 
optimization  algorithms  such  as  those  implemented  in  good 
currently-available  software  library  routines.  Our  doubt  is 
based  on  the  intolerably  slow  convergence  of  sequences  of 
iterates  generated  by  the  EM  algorithm  in  some  applications.  On 
the  other  hand,  it  is  entirely  possible  that  the  EM  algorithm 
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could  be  modified  to  incorporate  procedui es  for  accelerating 
convergence  and  that  ouch  modification  would  enhance  ita 
competitiveness.  It  is  also  possible  that  an  effective  hybrid 
algorithm  might  be  constructed  which  first  takes  advantage  of  the 
good  global  convergence  properties  of  the  EM  algorithm  by  U3ing 
it  initially  and  then  exploits  the  rapid  local  convergence  of 
Newton's  method  or  one  of  it3  variants  by  switching  to  such  a 
method  later.  Our  £ 'elmg  is  that  time  might  well  be  spent  on 
research  addressing  these  possibilities. 
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' ABSTRACT 

"Kriging"  is  the  name  o*  a parametric  regression  method 
used  by  hydrologists  and  mining  engineers,  among  others.  Features 
or  the  kriging  approach  are  that  it  also  provides  an  error  estimate 
and  that  it  can  conveniently  be  employed  also  to  estimate  the 
integral  of  the  regression  function. 

In  the  present  work,  we  describe  the  kriging  method  and  explore 
some  of  its  statistical  characteristics.  Also,  some  extensions 
of  the  Watson  method  are  made  and  theory  so  that  it  too  displays 
the  kriging  features.  Theoretical  and  computational  comparisons 
of  the  kriging  and  Watson  appraocht-s  are  offered. 
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Regression  Methods  for  Spatial  Data* 

by  S.  Yakowitz  and  F.  Szidarovszky 
Systems  & Industrial  Engineering  Department 
University  of  Arizona 


1.  Background  and  Scooe  of  this  Study 


4 


Specialists  in  hvdrologv,  mining,  oetroleum  engineering,  and  ; 

1 

other  geoscience-based  subjects  have  recently  exhibited  considerable 

interest  and  enthusiasm  for  a methodology  known  as  ‘'kriging''.  To  ! 

« 

name  only  a few  recent  (mostly  water-resource  oriented)  works,  we  1 

' 

mention  in  this  regard,  Bakr  et  al  (1978)  , Chirlin  and  Dagan 
(1980' , David  (1977),  Delhomme  (1978,  1979),  Dendrou  and  Houstis 
(1978),  Gambolati  and  Giampero  (1979),  Gambolati  and  Volpi  (1979), 

Gelhar  et  al  (1979),  Huijbreg  r<?  (1978),  Journel  (1974,  1977),  j 

Journel  and  Huijbregts  (1978) , and  Villeneuve  et  al  (1979) . The  • \ 

name  "Kriging"  derives,  according  to  Journel  (1977),  from  Knge  (1951), 

1 

where  the  basic  idea  was  first  outlined,  ttatheron  (196  3)  should 
be  credited  with  its  early  dissemination.  In  the  present  section, 

\ 

we  will  carefully  examine  the  statistical  problems  which  the  , 

Kriging  method  is  intended  to  solve,  and  in  Section  II,  we  will 

v a’1 

reveal  the  popular  kriging  algorithms  themselves  and  derive  their 
properties.  It  turns  cut  that  there  are  certain  unsatisfactory 
aspects  to  the  current  kriging  techniques,  and  yet  prior  to  the 
present  study,  they  appear  to  be  the  only  methods  appropriate  for 
the  problems  m their  domain.  However,  methods  of  nonparametric 
regression  are  certainly  somewhat  relevant.  In  Section  III,  we 

have  provided  an  extension  of  nonparametric  regression  theory  to 
increase  its  relevance  to  kriging  problems. 

* Some  of  the  results  of  this  work  nave  been  announced  m Szidarovszky 
and  Yakowitz  (1981),  and  also  in  a presentation  at  the  SIAM  Conference 
on  Deep  Mineral  Exploration,  Tucson,  October,  1931. 
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Let  f (x)  and  n(x)  be  uncorrelated  real-valued  functions  defined 
on  a domain  X ir  Rm.  Suppose  {(x^,y^)}^^  is  a sequence  of  "noisy" 
function  pairs;  that  is,  suppose 

y^  = f(xi)  + nfx.^),  1 <_i  <_N . (1.1) 

The  interpretation  is  that  f(x)  is  a function  whose  values  are  to 
be  estimated,  and  n(x)  represents  a noise  if  a 

measurement  is  taken  at  position  x.  We  discuss  below  two  problems 
which  are  central  to  the  kriging  literature: 

Problem  1:  Let  x*eX  be  specified.  It  may  or  may  not  be  among  the 

sample  pairs.  On  the  basis  of  the  sample  pairs  l (x^,y^) } 

(a)  Provide  an  estimate  f„ (x*)  of  f(x*),  and 

(b)  Provide  an  estimate  of  the  expected  squared  error 

E[(fN(x*)  - f (x*) ) 2| X x xnJ.  (1.2) 


Remarks.  The  goal  of  part  (a)  coincides  with  the  objectives  of 
"nonparametric  regression"  methods,  but  to  our  knowledge,  investi- 
gators in  this  latter  area  have  not  concerned  themselves  with  task 
(b) . Because  practitioners  desire  to  estimate  piezometric  head 
m oil  and  water  aquifers  or  the  grade  of  an  ore  body  as  a function 
of  position,  the  dimension  m of  the  domain  K is  often  2 or  3. 

- N 

Problem  2.  Let  ; (x^y^)  } il_i  be  as  above  and  let  D be  a subregion 
of  domain  X. 

(a)  Istimate  the  integral  /f(x)dx,  and 

D 

lb)  Provide  a formula  fcr  the  (sample-dependent)  expected  square 
error  of  this  estimate. 


u.  - 


Remark.  An  application  motivating  Problem  2 is  that  of  estimating 
the  total  weight  of  metal  which  can  be  extracted  from  the  ore  body 
occupying  volume  D,  given  imperfect  assay  estimates  of  the  grade 
at  distinct  locations. 

forestry  and 

Problems  1 and  2 seem  to  have  their  roots  in  the/ geostatistics 
literature.  In  fact,  it  seems  that 'geostatistic^' is  almost  synon- 
ymous with  kriging.  We  have  ro  doubt  that  oroblems 

1 and  2 are  important  and  interesting.  In  this  connection,  in  a 
review  of  a geostatistics  book,  Watson  (1977)  has  written,  "The 
time  is  certainly  ripe  for  a more  serious  attack  on  the  estimation 
of  the  earths'  resources,..." 

2 . Introduction  to  Kriging  Methodology 

In  the  kriging  approach,  it  is  presumed  that  f(x)  and  n(x)  in 
(1.1)  are  realizations  of  stochastic  processes  uncorrelated  from 
one  another  with  finite  second  moments.  It  is  further  assumed  that 

f(x)  is  a realization  of  an  intrinsic  random  function  (IRF) ; that  is, 

, j perhaps  unknown 

for  some  functions  i ;{k  (x) } known  to  the  user  and /constants 

a^>  - . .,  a j , for  all  x,  h such  that  x,  s+hcX, 

j 

E [f  (x+h) -f  (x)  ] = E a.  (<J>  (x+h)  -<t>  (x) ) (2.1) 

3 = 1 ->  ■>  J 

and,  independently  of  x with  "var"  signifying  "variance", 

1/2  var  [f(x  + h)  - f (x)  ] = y (h)  . (2.2) 

The  constants  {n_.r  1<^<J}  and  the  function  y (fc)  are  quantities 
which  must  be  inferred  from  the  data  {(x^y^jK^.  In  what  follows, 
it  is  presumed  always  that  J<N.  The  function  y (h)  is  called  the 
vanograra.  Even  in  the  case  in  which  the  mean  E[f(x)]  is  known 
to  be  constant  in  x (i.e.,  J=l,  = 1) , the  hypothesis  of 


"intrinsic  random  function"  is  weaker  than  second-order  stationarity . 
For  example.  Brownian  motion  is  an  intrinsic  random  function, 
but  it  is  well-known  to  be  a nonstationary  process. 

The  kriging  method  is  composed  of  two  activities,  (i)  Inferring 
the  variogram  from  the  data,  and  (ii)  Assuming  that  the  inferred 
variogram  is  indeed  exact,  providing  a best  linear  unbiased 
estimator  and  associated  error  variance,  as  required  by  Problem 
1 or  Problem  2. 

Activity  (ii)  is  a standard  least-squared  problem,  and  is 
consequently  by  far  the  best  understood  of  the  two  facets  of 

kriging.  There  are  some  inconsistencies  in  the  fundamental 
definitions  and  results  in  the  kriging  literature.  For  example, 

the  definitions  of  "intrinsic  random  function"  given  by  David  (1977) 
and  Matheron  (1971)  do  not  coincide.  The  discussions  of  noise 
and  the  "nugget  effect"  have  likewise  been  inconsistent.  The 
equations  for  kriging  in  the  presence  of  noise  as  given  by  Rendu  (1980) 
for  example,  agrees  with  our  calculations,  but  differs  from  formulas 
offered  by  other  authors  (e.g. , Journal  (1978)).  In  view  of  these 
inconsistencies,  we  have  elected  to  derive  the  "universal  kriging" 
equations  for  prediction  with  known  variogram  from  first  principles. 
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Linear  estimation  from  known  variograms 

To  begin  with,  suppose  the  noise  term  in  (1.1)  is  2ero.  Let 
us  assume  that  the  variogram  y (h)  and  the  mean  function  components 
{$^(x)},  of  the  expectation  (2.1)  are  given.  The  assumption  that 
one  of  these  functions,  say  tf>^,  is  1,  seems  to  be  a universal 
and  perhaps  unavoidable  assumption  which  we  also  will  adopt.  To 
begin  with,  let  us  discuss  the  solution  of  Problem  1.  The  objective 
is  to  choose  the  parameters  {\^}  ^ so  that  the  linear  estimator 

• 

= ^1^1  + ^2^2  + + (2.3) 

minimizes 

E( (f (x*)  - fN(x*))2] , (2.4a) 

subject  to 

F[fN(x*)]  = E ( f (x* ) ] . (2.4o) 


1 


In  view  of  the  assumed  form  (2.1)  of  the  mean  value  function,  a 
sufficient  (but  not  necessary)  condition  for  the  unbiasedness 
equation  (2.4b)  to  hold  is  that 

N 

I X . 6 (x. ) = <j>.(x*),  1<  j<J.  (2.5) 

i=l  1 3 1 3 


Equation  (2.  5)  with  = 1,  implies  that 


N 

E X.  = 1.  (2.6) 

i=l  1 

Use  this  fact,  with  the  unbiasedness  of  the  estimator  fN(x*)  of 

f (x*)  = f*,to  conclude  that,  with  "cov"  signifying  "covariance", 

N 2 11 

E[(f*  - Z X.y  )z]  = var (f * - E X.y.) 
i=l  11  i=l  1 1 


= var  (EXi  (f*  - y^  ) 

=E  E X.X  cov[ (f*  - y.) , (f*  - y )]. 
i j 3 J 

Now  observe  that 


(2.7) 


Cov[ (f*-vi) , (f*-yj) ] = 1/2  [-var ( (f*-yi)  - (f*  - v^)) 

+ var(f*-  y^  + var(f*  - y^)]  (2.8) 

= — y (x^  - x ^ ) + y (x*  - x^)  + y (x*  - x^  . 


One  ma<os  these  substitutions  into  (>.7)  ar.'’  a^ter  some  algebra, 
sees  that  the  Lagrange  multiplier  technique  for  minimizing 
E [ ( f (x* ) - f (x*) ) 2]  subject  to  (2.5)  yields 

N J 

E X,  v (x  - x,  ) = v (x.  - x*)  + Z p i.(x.),  l<_i<N  (2.9a) 
k=l  1 1 j=l  3 3 

N 

E X,  4 (x. ) =6  (x*) , 1<  3-J. 


i 


« 

• i 


1 


\ 


* 


! • 


-Jr, 


T 


iJ 


(2.9b) 


i 


OF  POG.i  GGAl.iVY 


The  variables  are  the  Lagrange  multipliers.  Journel  (1977) 
calls  the  above  linear  equation  the  universal  kriging  system. 

From  substitution  according  to  (2.9)  into  (2.7),  one  con- 
cludes that  the  mean  squared  prediction  error  is  given  by 


E [ (f  * - fN(x*))z]  = E A . y (x*  - x±)  - E u,<Mx*)  (2.10) 

i=l  j=l  -1 

If  the  noise  term  n(x)  in  (1.1)  has  zero  mean,  one  accounts 
for  its  presence  by  noting  that,  because  it  is  presumed  uncorrelated 
from  the  f-process, 

cov ( ( f * - yi),(f*  - y) ) = cov ( ( f * ~ -x  - “ f j ~ n^)) 

= cov((f*  - f^),(f*  - f j ) ) + cov(n^,nj). 

In  the  above  equation,  we  have,  of  course,  intended  that  n^  signify 
nfx^.  As  a result  of  the  above,  one  readily  sees  that  in  the 
presence  of  noise  (2.9a)  should  be  replaced  by  (2.9'a): 


N 

Z Xk(y(xi  - xk)  -2cov(n(xi)n(xk) ) = y (xi  - x *) 

k=1  (2.9'a) 

J 

+ r u .6  (x. ) , i<  kn. 
j-i  3 3 1 

Let  us  now  mvestiqate  the  modifications  necessary  for  solution 

replace 

of  Problem  2 described  in  Section  1.  Assume  /dx=l.  In  this  case,  we 

D 

the  objective  (2.4)  by  the  task  of  minimizing 

E[(/f(x)dx  - a y.)2]  (2.11a) 

D 1 1 

subject  to 

EUX  y ] = E [ff  (x) dx]  . (2.11b) 

D 

The  preceding  knaing  analysis  leads, in  the  integral  estimation 
case,  to  the  following  universal  kriging  system: 


(2.12) 


t-  i » 


N • J 

Z X.y(x.-x.)  « fy  (x.-x)dx+  Z l<i<N, 

k«l  K 1 34  D 1 j=l  3 3 1 

N 

Z X.<}>.(x.)  = /$(x)dx,  l<j<J. 
i=»l  1 3 1 D 3 


by 


The  expected  squared  error  of  the  integral  estimate  is  given 


E[(/f  (x)dx 
D 


N - N 

“ £ X.y.  P]*»  Z X./y(x.-x)  dx 
i=l  1 1 i=l  1D  x 

J t2*13> 

“ 1 U _ /4> ^ (x)dx-//y  (x-x* ) dxdx' 
j=l  JD  ■*  DD 


Inference  of  the  variogram 


The  task  of  inferring  a covariance  function  or  Dower  3oectral 
density  from  data  is  known  by  experienced  statisticians  to  be  some- 
what delicate,  and  one  which  furthermore  requires  a considerable 
quantity  of  data.  The  subtleties  of  the  covariance  inference 
problem  translate  directly  to  the  task  of  inferring  a variogram 
from  data. 

There  are  some  very  real  difficulties  with  variogram  estimation 
in  the  published  kngmg  applications.  To  avoid  effects  of  "non- 
stationarity" , oractitioners  tend  to  have  a single  variogram  apply 
onlv  to  a relatively  small  region  X of  domain  points  of  f(x).  More- 
over they  have  not  developed  procedures  to  ascertain  whether  the 
intrinsic  random  function  hypothesis  is  tenable  for  their  applica- 
tions. A particular  difficulty  is  that  in  the  bounded  domain  case, 
crgc^ic  tnpnreffis  are  inapplicable  tn  thp  t*sk  of  r* pmnn Ttr^ting 
consistency . To  our  knowledge,  with  the  exception  of  certain  extreme 
cases  such  as  white  noise,  no  methods  for  inferring  the  covariance 
function  from  sample  pairs  { (x, , f(x,))>,  f (1)  a fixed  sample  function. 


? 
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are  known  to  be  consistent. 

We  now  concern  ourselves  with  outlininq  the  present  practice 

with  regard  to  varicqram  inference.  The  recommended  procedure 
. parametric 

is  to  choose  ay^famiiy  of  vanoarams  from  the  five  or  six  pooular 

families  mentioned  in  the  literature,  and  then  to  select  the 

variogram  from  the  chosen  family  which  agrees  best,  in  some  sense, 

N 

with  the  covariance  function  constructed  from  the  data  {(x.,y.)} 

1 1 i=l 

We  list  in  Table  2.1  some  of  the  prominent  variogram  families. 


Monomial 

Spherical 

Exponential 

Gaussian 
where  9 = (a,oi) 


Yo(h)  = <u  | h j 


Y 0(h)  M 


“<!¥4<¥>3i  -i 


^0), 


h>a 


Y0(h)  = oi  [1-exp  (-  | h | /a)  ] 


2,2, 


Ya(h)  - 0)[  1-exp  (—  | h j /a  ) ] 


Table  2.1 

A Listing  of  Popular 
Variogram  Families 


There  sccnn  to  bo  no  eensensus  in  che  literature  on  ncthodologv 
for  the  selection  of  a parametric  family  from  Table  2.1  on  the  basis 
of  an  observed  sample  { (x^ ,y^) } • Some  heuristic  approaches  are 
proposed  by  David  (1977) . Concerning  the  task  of  selection  of  the 
member  Yq (h)  the  foremost  criteria  seem  to  be  (i) least  squares, 


and  a geometric  procedure  (David  (1977) ) . 

(ii)  cross  validation ,/  In  the  least  squares  approach,  one  selects 

the  parameter  9*  so  as  to  minimize 

V91  - Jj'W-W’2  ' 

the  index  v running  over  some  finite  collection  of  arguments  hv 

and  Yn ( h ) being  some  sample  approximation  to  the  variogram,  such  as 

N(h)  2 

hn(h)=l/2N(h)  fE  (y j " Yj  (h) } 

of  such  points  selected. 

where  j (h)  is  an  index  selected  so  that  x.  - x - h andN(h)  is  the  number/ 
"drift"  is  3 

If/  though  to  be  present  (that  is,  if  4 ^ , j > 1,  in  (2.1)  is  not  zero)  , 
then  this  approach  entails  some  serious  conceptual  difficulties. 

Matheron  (1971,  Chapter  4)  has  addressed  these  difficulties. 

The  cross-validation  approach  to  parameter  selection  is  as  follows. 

Let  P(x^,5)  be  the  universal  kriging  estimate  of  f(x^)  on  the  basis 
of  the  sample  points  { (x^ ,y^)},j^  and  parametric  variogram  y^ (h) . 

One  then  chooses  8*  to  minimize  the  squared  error  of  the  predicted 
values,  which  is 

n 2 '• 
1,(9)  = I (y  - P(9,x _))  . 

3=1  3 3 

Practitioners  insist,  quite  rightly,  that  one  should  not  select 
a variogram  entirely  algorithmically,  but  with  attention  also  to 
past  experience  with  similar  geological  data. 
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Convergence  and  Consistency 

With  the  exception  of  studies  by  Matheron  (1971,1973),  the 
literature  of  kriqing  tends  to  be  practical  and  pragmatic.  Ma3or 
issues  of  consistency  and  convergence  rates  have  not  been  addressed. 

In  the  developments  to  follow,  we  attempt  to  obtain  initial  results 
in  these  areas. 

As  has  been  noted  earlier,  there  is  no  consistent  variogram 

N 

estimator  based  on  observations  l (x^ , f (x^) ) } ^_^  for  x^  in  a 
bounded  domain  X and  f a fixed  samnle  of  an  intrinsic  random  function  f. 
In  short,  the  variogram  cannot  be  consistently  inferred,  even  if 
it  is  known  to  be  a member  of  a given  family  such  as  listed  in 
Table  2.1.  On  the  other  hand,  as  we  will  later  demonstrate,  under 
certain  circumstances,  the  kriging  estimate  will  converge,  with 
increasing  number  of  samples,  to  the  correct  value,  even  when  the 
variogram  is  not  correct.  An  interpretation  of  these  remarks  is 
that  the  kriging  method  can  be  effective  for  estimating  values  on 
the  basis  of  noisy  samples,  but  that  the  associated  error  estimate 
need  not  be  consistent.  This  interpretation  is  bourne  out  by  our 
simulation  studies.  The  fact  that  the  estimate  of  the  squared  error 
need  not  become  more  accurate  with  increasing  data  is  significant 
because  kriging  practitioners  and  their  clients  place  great  value 
on  the  error  estimation  feature. 

Let  us  begin  our  analysis  of  convergence  of  kriqing  estimate 

under  the  simplest  of  conditions  by  assuminq  that 

i)  The  observations  are  noiseless  (n(xi)=0) 

ii)  y(0)  = 0,  and  7 is  continuous  in  a neighborhood 
of  the  oriqin. 

iii)  There  is  no  "drift";  that  is,  J=l,  and  = 1. 

iv)  The  "true"  variogram  is  known. 


ii 
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OF  POOR  QUALITY 

Theorem  2.1.  Let  X be  the  domain  of  the  intrinsic  random  function 

i 

j 

f (x)  and  assume  the  conditions  above  are  in  force.  If  the  infinite 

and  i 

sequence  (x^}  is  dense  in  X,  then  for  any  x*eX^for  f^(x*)  as  in  ! 

(2.3),  | 

E ( (f  (x* ) - f?J(x*))2]  + 0 as  N + ®.  (2.14) 

for 

Proof . In  view  of  assumption  (iii) , every  i,  y^=f(x^)  is  itself  * 

an  unbiased  linear  estimator  of  f(x*),  and  so  for  N >_  1.  , 

i 

E [ (f (x* ) - fN(xi))2]  < E [ (f  (x*)  - f(Xi))2]  = 2y  (x*  - x£) . 

! i 

Let  x*  (N)  denote  the  member  of  which  is  closest  to  x*.  i 

By  the  assumption  that  {x^}  is  dense,  x*  (N)  x*  as  N + and  ' 

therefore  i 

E((f(x*)  - fN(x*))Z]  <E[(f(x*)  - f (x*(N))Z]=  2Y(X*-X*(N))  . (2.1b) 

The  proposition  follows  by  observing  that,  in  light  of  property (ii) / ■ 

y(x*-  - x*(N)  ) must  converge  to  0.  The  bound  given  by  (2.15)  may  be  . . ’ 

of  some  practical  interest  m itself. 

k 

The  Browian  motion  process  affords  an  example  of  n situation  ’*  1 

m which  the  best  estimate  is  not  consistent  unless  x*  is  an  accumula- 
tion  point  of  the  sample  points  Tor  Brownian  motion  is  Harkov, 

and  the  best  estimate  of  f(x*)  will  depend  only  on  the  points  — 

(x  , f (x  ))  and  (x,  ,f(x,  )),  where  x is  the  largest  domain  sample 
less  than  x*  and  x^  the  smallest  sample  greater  than  x* . 

■■  * 

There  are  many  common  situations  in  which  the  hypothesis 
that  ix^!  is  dense  in  X will  be  satisfied.  One  important  case  is 
that  m which  the  x^'s  are  selected  independently  from  X according 
to  a measure  that  assians  positive  orobabilitv  to  every  open  set 
(such  as  when  the  probability  density  function  exists  and  is  positive) . 

* 1 
* 

i 1 i u 
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Corollary  . Assume  that  the  hypotheses  of  Proposition  1 are 

in  force  and  additionally  that  X is  open,  and  has  finite  Lebesgue 

measure  y (h)  has  a continuous  second  derivative  and  the  samples 
identical]  y and  in  a neighborhood  of  x*. 

{x.  } are /independently  distributed  on  X with  pdf  bounded  awav  from  0 / Then  for 

some  fixed  constant  C and  all  N, 


E [ (f  (x*)  - fN(x*))2]  < C/(N(2/m)), 


(2.16) 


m being  the  dimension  of  the  space  containing  X. 

Proof.  Since  y (h)  is  an  even  function,  its  first  derivative  or 
gradient  must  be  0,  and  we  have 

Y(j.‘  - x*  (N)  ) = d/2)  (x*  - x(N)  )Ty(2)  (C)  (x*  -x  (N)  ) 

+ o ([ jx*  - x*(N) I I2)  (2.17) 

It  is  known  (e.g.,  Yakowitz  et  al  (1978),  p.  1299)  that-  under  the 
independent,  uniformly  distributed  sample  case,  for  all  points 
x*cX  and  some  constant  C^, 

E ( | j x*  - x* (N)  i ! 2 ] < C1  / (N(2/m)),  K=1 , 2 , . . . . (2.13) 

From  the  agrument  in  that  reference,  one  can  conclude  that  (2.18) 
holds  whenever  the  pdf  is  bounded  away  from  0 in  a neighborhood 
of  x*.  The  Corollary  now  follows  from  (2.17)  and  (2.18). 

From  our  experience  in  groundwater  analysis , where  the 
domain  points  correspond  to  well  locations,  the  hypotheses  of 
the  corollary  are  of  some  use.  On  the  other  hand,  for  some  ore 
sampling  strategies,  it  may  be  more  reasonable  to  assume  that  the 
xx 1 s form  a grid  cf  similar-sized  rectangles.  For  such  regular 
patterns,  one  may  conclude  that  (2.18)  is  true  without  expectations, 
and  hence  the  conclusions  of  the  Corollary  remains  valid. 


We  will  now  discuss  convergence  of  the  kriging  estimate  when 

accounting  for  drift.  Assume  that  x*  is  a limit  point  of  (x^I. 

Assume  furthermore  that  for  some  subsequence  x , ...,x  the  matrix 
A _ j nl  nJ 

i>={$.(x  )}.  , is  nonsingular.  (Otherwise,  there  is  no  hope  of 

~ i nj  i,j-i 

being  able  to  obtain  estimates  satisfying  (2.5)  for  arbitrary 
i>.(x*)  values.'  For  N>n_,  define  the  linear  estimate 

1 * U 

J 

fM(x*)  = (1  - a„)  f (x* (N) ) + a E xj  f(x  > , (2.19) 

N N N i=1  N ni 

where  x*  (N)  is,  as  before,  the  nearest  neighbor  (among  the  first 

N samples)  to  x*,  and 

aN  -| jx*  - x* (N) j j . (2.20) 

In  order  to  assure  that  tl.e  constraint  condition  (2.5)  holds, 
we  set  $ (x)  = (i).  ix),...,9  (x))T  and  determine  X ==  (X^,, . . . ,X^)T  by 

1 h:  “ iL(x*}  - (1  - aN)  i (X*(N)).  (2.21) 

The  cons.'stencv  of  the  estimate  f„ (x*)  will  follow  if  only  we  can 
show  that  the  sequence  remains  in  a bounded  region.  Toward 

that  end,  note  that  after  taking  a Taylor's  series  expansion  of 
Mx*)  - Mx*(N))  and  dividing  by  aN,  we  nay  rewrite  (2.21)  as 

0 | (x*)  - (1/ctjj)  V#  (x*  (N)  -x* ) 

+ l/aNo(|  jx*  - xMN)  | | ) , (2.22) 


where  the  matrix 


! 


jv  4,  <x*)\ 

I : |. 

From  (2.22),  we  see  that  X_N  remains 
according  to  (2.20).  In  fact, 


Of  POO."? 


is 

QUALITY' 


(2.23) 


bounded  when  ct^  is  chosen 


( 


!|\NU  <!!«“1il  iSIv^ll  + I !*(x*> ! |]+ ou> . gj 

We  have  demonstrated  above  that  the  constrained  linear  predictor 

f (x*)  converges  to  f(x*).  By  an  earlier  argument,  this,  in  turn, 

N converge. 

implies  that  f.T(x*),  the  best  linear  unbiased  estimator  must  likewise  / 

The  interested  reader  can  apply  the  convergence  analysis  of  the 

for  estimation  with  drift  by 

preceding  discussion  to  achieve  convergence  bounds /using  structures 
the  preceding 

in  the  proof  of^Corollary . 

Our  attention  now  turns  to  the  case  that  noise  n(x^)  is  present 
in  the  observation  law  (1.1).  For  simplicity,  assume  that  J=l,  and 
$^=1.  If  n(’)  is  a continuous  function,  then  apparently  consistent 
identification  of  f(x*)  is  not  possible  since  local  samples  cannot 
distinguish  between  the  effects  of  signal  and  noise.  However,  the 
linear  estimate  provided  by  the  universal  kriging  equations  is  an 
appropriate  procedure  and  in  fact  coincides  with  what  is  known  to 
communication  engineers  as  a "smoothing  filter".  If  (ntx^)} 
are  independent  variables,  then,  as  we  now  demonstrate,  under  some 
circumstances,  consistent  estimation  of  f(x*)  is  possible.  Toward 
verifying  this  assertion,  as  in  earlier  arguments,  we  find  a linear 
estimator  whose  properties  are  understood,  and  then  appeal  to  the 
fact  that  since  the  kriging  estimate  is  optimal  m the  least 
squares  sense,  it  must  be  at  least  as  good  as  the  estimator  under 
consideration . 


1 


For  the  particular  task  at  hand  of  verifying  con- 
sistency in  the  presence  of  independent  noise,  it  is  sufficient  to 

call  attention  to  the  fact  that  Stone (1977)  has  discussed  a general 

(NPR) 

class  of  nonparametric  regression^formulas  of  the  form 
- N 

fN(x*)  = i^1  yi  wi,N<x*;Xl' * * * /XN* * 

N 

The  weights  w.  can  be  taken  to  add  to  1 (i.e.,  Z w.  =1),  so 
i »N 

the  unbiasedness  condition  (2.5),  with  J=1  holds. 

His  results  imply  that  if  x*  and  x^  are  i.i.d.  observations,  and 
if  f(-)  is  measurable,  and  provided  the  weight  functions  w.  . (•) 

X / r 

A 

satisfy  cetain  natural  properties,  then  fN (x*)-f (x*)  in  the  mean- 
Toward  applying  Stone's  results  to  the  issue  of  consistency 
of  knging  estimates  in  the  noisy  observation  case,  let  f(-)  denote 
a realization  of  the  intrinsic  random  function  f(-).  Then  if 
yi=f(xi)  + ntx^,  the  sequence  {(x^y^)}  constitutes  i.i.d. 
observations  and  the  hypotheses  of  Stone’s  convergence  result 
holds,  provided  f(-)  is  so  much  as  measursable  and  a few  technical 
assumptions  of  little  practical  concern  hold.  So  we  conclude  that 

E[(fN(x*)  - f{x*))2J*0,  as  N-«. 

It  may  be  concluded  that  if  the  noise  measurements  and  the  sample 

functions  f are  uniformly  bounded,  then  convergence  occurs  without 
on  sanple  function  f ; 

the  conditioning/  alternatively,  without  the  boundedness  assumption, 
or.ecan  assert  that  convergence  m the  mean  is  assured  outside  an 
f-set  of  any  positive  measure.  From  results  m the  next  section, 
it  may  be  seen  that  if  one  is  willing  to  assume  that  the  sample 
functions  are  twice-contmuously  differentiable,  then  convergence 
m the  mean  is  on  the  entire  f-space  without  the  set  qualification. 
Convergence  m the  mean  of  the  linear  estimate  f implies,  of 
course,  mean  convergence  of  the  krigmg  estimate. 
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For  certain  specific  NPR  estimates,  rates  of  convergence  are 
known  (e.g.,  Fisher  and  Yakowitz  (1976),  Parthasarathy  and  Bhattacharya 
(1961),  Sacks  and  Spiegelman  (1930),  Schuster  and  Yakowitz  (1979)). 

The  strongest  results  related  to  convergence  of  point  NPR  estimates 
known  to  us  are  that  of  Schuster  (1972)  for  one-dimensional  x^'s, 
and  for  m-dimensional  x^'s,  the  result  to  be  demonstrated  m the 
next  section,  that  for  m^x*)  the  Watson  NPR  estimate  for  f(x*), 
that  with  seme  provisos  to  be  specified  in  Section  3, 


E ( (m,.  (x*)  - f(x*))2]  = 0(n“(1/(m/4  + X))).  (2.24) 

This  convergence  rate  will  be  seen  to  be  optimal,  in  a certain  sense. 

In  evaluating  the  convergence  statements  concerning  krigmg 

up  to  this  point,  it  should  be  emphasized  that  they  are  valid 

only  if  f(-)  really  is  an  intrinsic  random  function  and  the 

variogram  and  drift  functions  are  known  perfectly. 

Our  next  discussion  of  kriging  convergence  is  directed  at 

Problem  2 of  Section  1,  i.e.,  the  integral  estimation  problem. 

For  roblem  2,  as  has  been  observed  earlier,  one  must  modify  the 

universal  kriging  equation  development  by  replacing  f*  in  (2.7) 

by  / f(x)dx.  The  effect  of  this  substitution  is  that  y (x  - x*) 

D 1 

and  o . (x*)  are  replaced  by  fy (x.  - x)dx  and  /g  (x)dx  in  (2.9a) 

J D 1 m3 

and  (2.9b),  respectively. 


Let  1(f)  denote  the  universal  kriging  estimate  of  /f(x)dx 

D 

obtained  by  the  modifications  just  mentioned,  and  let  f (x*) 

denote  the  kriging  estimate  of  f(x*).  Recall  our  assumption  that 

/dx=l.  Then  we  have  the  following 
D 

Theorem  2.2. 


1(f)  = / f N(x)  dx. 
D 


(2.25) 


||  r iHrr  fm  ■ l ■'Vlri 
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Proof. 


One  may  express  (2.9a,b)  in  matrix  form  as 


-1 


X(x*)  = 4 c(x*)  (2.26) 

where  X(x*)  = (\1#...,XN#  P . . . ,yJ)T,  c^fx*)  = y (x±  - x*)  , l<i<N, 


and 


AiD  - Y(x.  - X;)),  l<i,D<N;  Ad+Nj.=  Ai>;)+N  = *3  (x^  . 1<1<N,  U3<j. 

From  (2.3),  we  see  that  if  we  define  (f  [xj  , . . . ,f  , 0,...,0),  then 

f (x*)  = £ A"*1c_(x* ) . (2.27) 

Now  it  is  clear  from  (2.12)  that  for  the  integration  problem, 
universal  kriging  equation  may  be  represented  as 


1(f)  = 6 A-1  /c(x)dx  = /0  A 1c(x)dx  = / f„(x)  dx 
D D D * 


(2.28) 


and  our  proposition  is  established.  ES 

The  predicted  mean  snuare  error  was  given  in  (2.12).  But 
the  following  evident  result  is  useful: 

Corollary: 


F[fl(f)  - / f ( x ) dx ) 2 ] < (fN(x)-f(x)  )2]dx)1/?  * 2 

D J^D 


■] 


(2.29) 


Proof . 


E[(  /f^(x)  “ / f ( x)  ) ] = ff  cov  ( f (x)  - fu(x),  f (x)  - fN(x))dx  dx 

1/2^2 

'<11  IJL  ylA)  / 

D 


D 


< | /var  (f  N(x)  ) 


f **§  - 

"dxT 


From  the  corollary,  it  is  apparent  that  earlier  bounds  with 
respect  to  convergence  of  krrging  point  estimates  can  be  directly 
applied  to  bounding  the  convergence  of  integral  estimates  as  the 
number  of  sample  pairs  increases.  Moreover  the  above  analysis  is 
applicable  to  estimates  of  other  linear  functionals  L(f'  of  tne 


r 

t 

,*■ 

J 
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The  final  theoretical  topic  we  shall  broach  , in  connection  with 
kriging  convergence,  is  the  effect  of  incorrect  variogram  on  the 
estimate  fN(x*).  For  simplicity,  assume  the  no  drift  case.  It 
is  fairly  clear  * at  if  the  variogram  is  in  error,  there  is  little 
hope  of  estimating  E[(f(x*}  - fN(x*))^J  correctly. 

Example.  In  this  example,  we  show  that  it  is  possible  for  the  krigir. 
predictor  to  be  exact,  while  the  variogram  (and  hence  the  error 
estimate)  to  contain  significant  error.  Suppose  Y2=by^ , where  b 
is  any  positive  constant.  If  X=  (,\. , . . . , X ) is  the  minimizer  of 
(2.7),  subject  to  the  constraints  (2.5),  with  y=y^,  then  will 
also  be  the  constrained  minimizer  of  (2.7)  with  y=Y2 • Thus  if 
a presumed  variogram  is  so  much  as  approximately  proportional  to 
the  correct  one,  the  estimate  fN(x*)  will  be  reliable.  But  from 
(2.10),  one  sees  that  (ignoring  the  drift  term)  the  error  estimate 
under  y2  w-H  differ  from  that  under  by  the  scale  factor  b. 
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Let  ^ and  x^  be  the  solutions  of  the  universal  kriging 
equation  (2.9a,b)  under  variograms  y^  and  y2»  respectively. 

Suppose  that  for  some  positive  number  6 and  all  h,  | | y_ (h)  - Y2^h)||<6. 
From  a standard  numerical  analysis  formula  (e.g.,  Szidarovszky 
and  Yakowitz  (1978),  p.  214),  we  have  that  for  6 < ||A||/F(A), 

i li (1)  - X (2)|  | < r(A)  I |X  (1,I  J«/(|  |A|  1-6  r<A))  (2.30) 

where  A is  the  matrix  determined  in  connection  with  (2.26)  and 

r (a)  = i I a| | | 1a"1! | 

in  the  condition  number.  Some  insight  into  the  potential  pernicious- 
ness of  variogram  error  can  be  inferred  from  (2.30)  by  considering 
that  the  linear  equation  associated  with  least  squares  problems 
frequently  is  ill-conditioned  because  of  collinearity  effects. 

This  phenomenon  is  evidenced  by  large  condition  number  r (A) . 

Let  and  E2  be  expectations  of  square  differences  determined 
variograms 

by/Y^  and  y2  respectively  and  (2.8).  From  earlier  developments,  we 
can  be  assured  that  if  the  kriging  equation  (2.9)  uses  y = y2» 

E2((f(x*)  - fN(x*))2]  h.  o 

provided  only  that  {x^}  is  dense  in  X.  In  the  noiseless  case, 
and  determine  metrics  d^  and  d2  (2.3)  on  V = span  (f(x)  : xeX} 
according  to 

dVJ,2)  = (E3ly-  z]2)1/2,  3 = 1,2;  y , 2 e’/  . (2.31) 

Thus  v ls  the  smallest  space  containing  all  linear  predictors. 

The  task  of  finding  the  circumstances  relating  to  and  y2  » under 
wmch  d^  and  d2  determine  equivalent  topologies  remains  a subject 
for  future  research.  At  this  point,  one  can  quickly  confirm  that 
11  = taf(x)  : a real,  xtX,  is  dense  m /,  with  respect  to  both 
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metrics,  and  if  both  venograms  are  continuous  and  have  tne.r 
unicue  minima  at  tne  oriom,  then  ,2.31)  implies  convereence  with 


resoect  to  ^ 


For  under  these  assumptions,  for  any  unbiased 
linear  predictor  fNlx*)  in  /,  there  is  a domain  point  xN  and  a 
random  variable  f(xN)s^  such  that 


d. (f (xN) ,fN(x*) ) < 1/N,  D = 1*2, 


Note  that  m view  of  (2.31), 

d (f (xN) ,f (x*) ) = 2y.(xN-x*),  3=1,2. 


(2.32) 


(2.33) 


So  if  d.  (f  (x*)  ,f..(x*))-0,  then  xN-x*.  Now  (2.32)  and  (2.335 
I N 

imply  that  d2  ( f (:<*)  , fj.  (x*) ) -0  . 

Unfortunately,  it  is  not  always  the  case  that  converaence  m 
d^  implies  convergence  with  respect  to  d^. 

Example . Let  v9  be  a bounded  vanogran  and  and  unbounded 
function  (as  in  the  Brownian  motion  case)  . Suppose  that  g,. 

Define 

(2.  34) 


converges  to  f(x*)  witn  respect  to  both  metrices  d^  and  d7 


fN(x*)  = (l-^)gr;  + XNf(zN) 


where  is  a seouence  of 
N 


numbers  converging  to  0 and 


z.,  are  points  m K such  that  y (x*  - z._)  > 1/'  . Then  still 

1 N N 

d_(i(x*).  f..(::*))  - 0,  but  d.  (f(x*),  f .Ax*))  will  be  approximated  bv 

•L  N IN 

r /l ,)'m(x*  - z.T)  , which  is  bounded  awav  from  0.  An  unsettling  aspect 
of  this  example  is  that  one  may  verv  well  anticipate  that  for 
relatively  large  sample  sizes,  least  sauares  estimates  may  well 

assign  some  weiont  to  samples  x^  with  domain  points  far  from  x-*. 

2 

Tr.ese  Points  could  cause  havoc  if  Effffx^)  - f(x*))  ] is  much 
larger  than  anticipated.  In  hriging  practice,  it  is  common  to  assume 
tnat  vanograms  are  bounded  by  "sills". 
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3.  Nonparametric  Regression  Application 

The  intention  of  the  present  section  is  to  reveal  extensions 
of  nonparametric  regression  which  make  this  approach  more  suited  to 
Problems  1 and  2 of  Section  1.  In  the  section  to  follow,  a 

comparison  of  properties  of  knging  with  those  of  nonparametric 
regression  will  be  offered. 

The  particular  nonparametric  regression  (NPR)  method  to  be 
investigated  here  is  the  kernel  estimator  proposed  by  Watson  (1964)  . 
The  two  developments  revealed  here  are  (i)  a formula  for  the 
asymptotic  expected  square  error,  and  (ii)  a data-based  approximation 
of  the  mean  squared  error.  The  discussion  closes  by  showing  that 
the  asymptotic  convergence  of  the  NPR  estimates  is,  m a certain 
sense,  optimal. 


set  (X,Y)  denote  jointly  distributed  random  variables.  The 
dimension  m of  X is  arbitrary,  but  Y is  real.  Nonparametric  re- 
gression methods  are  intended  for  the  problem  of  inferring 
tne  conditional  expectation  (i.e.,  regression  function) 

m(x)  = E[y;x=x]  ' (3.1) 


on 


the  basis  of  an  observation  ' (x  ,y  )’  N,  of  the 

i 1 i i=l 


seauence 


<w:- 

To  begin  with,  let  us  phrase  Problem  1 of  Section  1 m NPR 
that 

terms.  One  presumes^the  random  vector  (X,Y)  satisfies 


V = n(X)  + n (X)  , 


(3.2) 


where  m is  a fixed  deterministic  but  unknown  "regression  function". 
In  NPR,  tne  distribution  of  the  noise  n(x)  rr..v'  depend  on  the  domain 
point 

The  Watson  NPR  esimator  is 


jMteQc  'rLdd'S'*..  ‘Za-u 


pAr;£  J« 


1 


nv.{x)  = [Z  y±  k((xi  - x)/aN)  ]/DN  (x)  (3.3) 

N 

where  D (x)  = £ k((x  - x)/aM),  aM  is  a positive  number,  and 

k(*)  is  a probability  density  function  chosen  by  the  user.  By  way 
of  convergence  results,  it  is  known  (Schuster  and  Yakowitz  (1979)) 
that  if  {a^},  k(x),  and  the  (X,Y)  variable  satisfy  certain  lenient 
conditions,  then  for  any  aiven  e>9,  there  is  some  constant  C such  thai- 


for  every  N, 


Pfsup^  jmN(x)  - m(x) j >e]<C/(Na^) 


(3.4) 


It  is  often  not  practical  to  compute  the  constant  C and  m anv 

event,  the  bound  above  is  typically  pessimistic. 

Since  in  knging  squared-error  is  the  essense, 

our  analysis  at  this  point  is  directed  toward  establishing 

2 

the  behavior  of  Elfm^fx)  - m(x) ) ] as  the  number  N of  observations 
increases.  Toward  that  end,  let  h(x,y)  and  g(x)  be  the  pdf’s  of 
( X , Y ) and  X,  respectively.  Let  w{x)  = / y h(x,y)dv,  thus 
n(x) =w (x) /g  (x) , and  define  for  some  m-tuple  x,  and  l<i<_!I 


m 
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??«? 


UNi  = k((x*  - X.)/aN)/aN  , 


Ni 


Y . U . , 
l Ni 


(3.5) 


UN  = 1/N  £ UNi  1 1-1-N' 

VN  = 1/N  Z VNi  : l<i<N. 

Throughout  this  section,  we  will  assume  that  the  kernel  pdf  k(u) 
is  selected  so  as  to  satisfy  the  properties  (i)  to  (iv)  below: 

(i)  k(u)  and  ||uk(u)||  are  bounded, 

(ii)  / uk  (u)  du  = 0, 

(iii)  /liu'ii2k(u)  du  < «, 

(iv)  the  functions  g(x)  and  w(x)  are  twice  contin- 

uously differentiable  and  the  second  partial  deriva- 
tives of  g(x)  are  bounded, 

(v)  the  second  moment  of  Y if  finite. 

The  pdf  of  the  multivariate  normal  law  satisfies  properties  (l)  to  (in) . 

The  convergence  facts  we  will  need  are  given  in  the  state- 
ment below. 

Theorem  1.  Let  m be  the  dimension  of  the  sample  vectors  x^.x^,... 


and  assume  g(x*)  > 0.  Then 


(a)  var  (V  ) and  var  (u., ) are  both  0(1/ (N  a.,m)  ) , 

lb)  (E[Vm]  - w(x*))  and  (E[U.,1  - g(x*))  are  both  0{a  2). 

N N N 

- ( 1/ (m+4 ) ) 

(c)  If  a^  = N , then  for  some  sequence  of  events  E,. 


such  that  PfE^]  - 1, 


_ i 

i 


E { (n.-  ( x* ) - .ti(x*))2  E ] = 0(N_(1  + m/4)  ). 

-n  n 


(3.6, 


Prccf . This  theorem  is  very  much  inspired  by  developments  of 
Scnuster  (1972).  Thus  part  (a)  is  essentially  formula  (1)  m the 
proof  of  his  Lemma  1,  but  extended  here  to  m variables.  In 


n 


particular,  after  a change  of  variables  to  u = (x^  - x*)/aN 


have 

E[(uNi)2]  = a“m  /k(-u)2  g (x*  - aNu)du 

» 'g(x*)/aNmj[/k2(u)  du  + 0(aN)]. 
Similarly,  one  may  confirm  that 


E CUNi ) = g (x*)  [/k (u) du  + 0{aN)]. 

Now  use  that  the  variables  are  uncorrelated  to  get  var(UfJ)  = OUa^N)  ) . 

The  demonstration  for  var(V__)  oroceeds  in  the  samp  fashion.  The 

N 

proof  of  part  (b)  is  essentially  that  of  the  first  part  of  Lemma  2 

m Schuster  (1972).  Thus  after  the  change  in  variable,  and  use  of  assumed 
property  (11)  above, 

E(UNil  - g(x*)  =/k(-u)(g(x*  - a^u)  - g(x*)]  du 

<_(aN2/2)  supxJ  1 o’ * (x)  j j / 1 [ u[  | 2k  (u)  du  = 0(af.2} 

Clearly,  and  have  the  same  expectation.  The  analysis  of 

E ( Vx, ) proceeds  m a similar  fashion. 

Toward  demonstration  of  (c) , define  to  be  the  event  that 
t;N>  (1/2:  g IX*)  and  V 2 w(x*)  .. 

In  view  of  parts  a and  b and  Chebyshev's  inequality,  the  probability 
of  converges  to  1.  Also,  note  that  var  (L*  1 E ) < var  (UN.) /?  ( E.. ] and 


var  (V.,  £..)  - ”ar  iV„ )/c  ir.,1  . 

A .4  - N 

n (•<*)  - ruM  1 = 


Now  under  E.,, 
N 


rV'<  > ~ = (V„g(x*)  - UNw(x*)  )/u  g ::<*)  ■ 

1 (2/g(x*)“)  (,wi.<*MUN  - g(x*))|  fg{x*);VN  -w(x*j;) 
rart  c no«/  is  easily  seen  to  be  a consequence  cf  < a > and  (j 


OF  POGii  QUALITY 


Our  attention  now  turns  to  derivation  of  a data-based 
estimate  of  the  mean  squared  error  of  the  NPR  point  estimate  n^(x) : 

cr^  (x)  =E  I (m^j  (x) -m(x)  ) X^=Xj  , 1<  j <_  N]  . (3.7) 

Observe  that  since  the  terms  (V  . } . in  (3.5)  are  uncorrelated, 

1 X 


N 

o2(x)4(l/DN(x) 2)  (_Z  E[n(xi)2)k2((x-xi)/aN)) 


(3.8) 


The  only  term  in  (3.8)  which  is  not  known  to  the  statistician  is 
E[n(x^)2].  But  this  can  be  approximated  from  the  sample  by  defining 
a to  be  any  positive  number  less  than  1 and  defining 


E[n(xi)2]  = 1/Na  C^-m^x^))2)  „ 

jeS  <i,N) 


(3.9) 


where  S(i,N)  is  the  set  of  indices  of  the  Na  nearest  neighbors  m 
"xk^k-l  xi * Since  in  view  of  (3.4),  'i  * s converging,  in  N, 
to  m(x)  =E  [Y  ,X=x)  , uniformly  m * , ’ ~ir.ce  probability  1,  the 

radii  of  the  sets  S(},N)  beet-  vanishingly  small  as  , zt  is 
evident  that  the  estimate 


">  t M - 2 2 

3-(x)l(l/D„(\))  Z E(n(x  )z]k  ((x-v  »/: 
1=1  1 


(3.10) 


satisfies  the  relation 


72  (x) /c"  (x) -1  as  N-™,  l.p. 


(3.11) 


Note  tna:  the  estimator  - (x)  cepends  solely  on  the  statisticians 

cnoices  of  ’<(•)  and  'a(n)},  and  the  observed  sequence  *(x^,y  ) . 

From  tr.e  Theorem,  one  mav  conclude  that  if  av,  tends  to  zero  siimhtl" 
, ( i / 4)  ) 

faster  tnan  (1/M)  ' , then  tne  variance  error  part  of  (a)  -/ill 

dominate,  yet  need  not  seriously  degrade  the  rate  of  convergence  m 

(3.6).  Under  this  circumstance,  o“(x)  will  be  an  asymo totical ly 

souare 


?XCZ  ’.s 

V-.T’' 


One  can  confirm  that  for  any  6>0,  as  a^-<-0,  the  contribution 
in  (3.10)  of  terms  x^  such  that  [ x - x {;>£,  becomes  negligible, 
and  in  practice,  we  have  found  that 


c2 (x) =E [n (x) 2 ] , 


(3.12) 


gives  a reliable  approximation  of  the  error  variance,  Similarly, 

1 2 

one  can  show  that  for  any  points  x ,x 

N 


Cov(  x,x~)=  [ (1/DN  (x  )D^(x  ))  Z k((x  -xt) /a^)  k { (x1-xi) /a^)  ) 


[E  (n  (x  )2)E(N(xX)2)  ]1/2t 


(3.13) 


is  asymptotically  accurate. 

This  relationship  is  useful  in  applying  the  NPR  approach  to 

Problem  2,  of  Section  1,  as  is  now  seen. 

Our  procedure  for  approaching  Problem  2 is  to  apply  numerical  quadrature  to 

, M 

the  function  mN(x).  Specifically,  let  t (t ^ , w ) s be  quadrature 
points  and  weights  for  integrating  over  the  desired  domain  D.  The 
nonparametnc  estimate  I is  then  defined  by 

M 

= Z w m_.  (t  ) , 

N i=l  3 N 3 

11 

which  is  an  estimate  of  I = E w m{t  ) . 

i=l  3 3 

The  error 

IM  - /m(x)dx 
N D 


(3.14) 


has  two  components, 


IN  - /m(x)dx  = [I  - I]  - (I  - /P(x)dx], 
D D 


(3.15) 


the  first  bracketed  term  being  the  error  due  to  annroximation  of 
the  function  n(x)  by  r^(x)  in  tne  quadrature  formula,  and  the  second 


Ti 


I ! 

i j 

! ! 
i 


i 

' i 


J A/f  Tf -i  W 


-ii  ^ 
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arising  from  quadrature  truncation  error  in  approximating  the  inte- 
gral. Methods  for  bounding  the  latter  source  are  found  in  the 
numerical  analysis  literature  (e.q.,  Szidarovszky  and  Yakowitz, 
(1978),  Chapter  3).  For  example,  if  D is  an  m-dimensional  unit 
cube,  and  one  applies  a product  trapezoidal  quadrature  rule  (keeoing 
in  mind  that  the  t^'s  in  (3.14)  can  be  chosen  arbitrarily),  one  can 
verify  that,  provided  m(x)  has  continuous  second  derivatives, 

J I - f m(x)dx  I = 0(h),  h being  the  step  size  for  the  quadrature 
formula. 

The  variance  of  the  first  source  is  given  by  Ew^w^  Cov(t^,t  ), 
which  can  be  approximated  by 


E[Im  - I)2]  = EEw.w.  Cov(t. ,t  ):  l<i,j< 
N 1 3 1 3 


M , 


where  the  term  Cov  is  the  covariance  approximation  given  m (3.13). 
As  we  have  noted,  as  , the  covariance  terms  become  negligible 
and  useful  approximation  is  that,  in  terms  of  (3.10), 


M 

E[(Im  - I)2)  - E w.  2 o 2(t.  ) . 
tN  i=l  1 1 


(3.16) 


The  final  consideration  of  this  section  concerns  a certain 
optimality  property  of  NPR  convergence.  In  view  of  (3.6)  and  the 
Chebvshev  inequality,  one  can  conclude  that  for  r^=2/(m+4),  and 

for  any  regression  function  m(x)  and  noise  process  n(x)  satisfying 

- ( m-M ) - ^ 

the  tneorem  hypothesis,  if  a^-0  proportionally  to  S , then 


tor  r=r< 


lim  lim  sup  [P  [ (x*) -m(x* ) ' Wr]-*-0. 

C-«  N 


(3.17) 


Thus,  m tr.e  terminology  of  Stone  (1980),  the  NPR  estimate  achieves 
convergence  rate  r*.  But  according  to  the  Theorem  of  that  work. 


c 


*t  y*  - r *■ 


n jkJu. 


li 


for  any  NPR  estimator  of  a twice  continuously  differentiable 
regression  function  of  m independent  variables,  r*=2/(m+4)  is 
the  optimal  rate:  there  is  no  estimator  for  which  (3.17)  holds 

for  some  r>r*. 

4 . A Comparison  of  Convergence  Properties  of  Knging  and 

Nonparametric  Regression 

Assume  that  the  intrinsic  random  function  (IRF)  hypothesis 
holds,  and  there  is  no  drift  (J=l,  4>^=I) . As  mentioned  at  the  close 
of  Section  2,  the  nonparametric  regression  (NPR)  approach  is 
applicable,  if  the  sample  domain  points  {x  are  chosen  randomly  and 
if/  "xi^i=i  with  probability  1,  the  sample  functions  f(x)  of  the 
IRF  are  continuous  at  x*,  then  m^ (x)  converges  to  f(x)  in  the  mean. 

If  the  sample  IRF's  are  twice-continuously  differentiable,  with 
probability  1,  then  Theorem  3.1  gives  convergence  rates. 

Toward  addressing  parts  b of  Problems  1 and  2 of  Section  1, 
we  have  provided  error  formulas  (3.10)  and  (3.16)  which  are 
asymptotically  accurate  provided  only  that  the  sample  functions  are 
continuous  at  x*.  These  convergence  properties  hold  regardless  of 
whether  noise  n(x)  in  (1.1)  is  present.  But  all  these  statements 
have  bean  predicated  on  the  assumption  that  the  values  are 

actually  a random  sample.  However,  under  fairly  lenient  assumptions, 
Schuster  and  Yakowitz  (1980,  Theorem  2)  have  shown  in  the  univariate 
case  that  m^x)  converges  uniformly  in  x to  f(x)  provided  only  that 
the  x^'s  are  dense.  Undoubtedly  such  results  can  be  extended  to 
bear  on  kngmg-type  problems  more  focefully,  and  citations  of 
related  results  (especially  concerning  the  Pnestly-Chao  estimator) 
are  to  be  found  in  the  above  reference. 


1 


r *■*„'•* 


Now  it  is  clear  that  if  the  variogram  is  known  exactly,  1 

because  the  kriging  estimator  is  the  best  unbiased  linear  j~ 

estimator,  then  the  expected  square  error  of  the  kriging  estimator 

f 

fN(x)  is  no  greater  than  that  of  the  NPR  estimator,  which  is  also  j 

linear  and  unbiased.  On  the  other  hand,  in  the  noisy  case,  it  is 

i 

not  known  at  this  point  whether  its  asymptotic  convergence  rate  is 
faster  than  the  NPR  rate  given  in  Theorem  3.1.  In  summary,  when 

i 

the  IRF  hypothesis  is  true  and  the  variogram  is  known  to  the 
statistician,  the  kriging  estimate  is  better  in  the  least  squares  , 

sense  than  the  NPR  estimate,  and  its  error  estimators  (2.10)  and 

i 

(2.13)  are  exact,  whereas  the  NPR  error  estimators  are  only  * 

asymptotically  accurate.  * 

i 

On  the  other  hand,  if  the  IRF  hypothesis  cannot  be  relied  on, 
or  even  when  it  can,  if  the  variogram  is  not  known  (even  if  it  is  , 

known  to  be  in  one  of  the  parametric  families  of  Table  2.1),  then 
nothing  can  be  said  about  the  convergence  of  either  the  kriging 
estimator  or  the  error  function,  whereas  NPR  convergence  conditions 
we  have  alluded  to  may  well  be  satisfied. 

5.  Some  Illustrative  Computations 

We  hope  to  eventually  publish  results  summarizing  our  extensive 
comDutational  experimentation  on  kriging  and  alternative  procedures. 

For  now,  we  provide  a brief  illustration  of  the  preceding  material 
by  reporting  just  a few  computations.  In  this  particular  case  study, 
the  function  f(x)  is  chosen  to  exactly  satisfy  the  kriging  hypotheses: 


/ 


/ 


* 
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It  is  a realization  of  the  Gaussian  process  with  mean  0 and  variogram 

Y (h)  = C(1  - exp (-25h) ) . (5.1) 

V7e  have  plotted  f (x)  in  Figure  1.  The  sample  function  f (x)  was 
simulated  according  to  an  algorithm  described  in  Newman  and  Odell 
(1971)  and  is  exact  (within  machine  error)  to  the  extent  of  one's 
being  able  to  provide  independent  Gaussian  observations.  These 
we  approximated  by  the  Box-Muller  algorithm  (described  m Yakowitz 
(1977))  using  the  CDC  random  number  generator  RANF. 

In  Table  2.2,  we  report  the  results  of  applying  the  knging 
method  with  exponential  variogram  to  50  uniformly  chosen  domain 
points  from  the  domain  X = [0,1].  (RANF  was  used  to  obtain  these 
ooints  also.)  In  the  first  listing,  we  give  the  approximation  at 
eight  equidistant  domain  points,  of  the  kriging  algorithm  in  which 
the  exponential  parameter  has  been  set  to  its  correct  value.  This 
is , therefore , kriging  under  the  ideal  conditions  of  the  variogram 
being  known.  In  the  second  exponential  listing,  the  variogram 
parameters  a and  w were  obtained  by  least-squares  fit  according  to 
current  practice.  In  Table  2.3,  we  have  repeated  the  calculations, 
but  Gaussian  noise  n(xi),  with  standard  deviation  a = 0.5,  was  added 
to  each  value  £{x^).  In  Table  2.3,  we  have  repeated  the  calcualtions , 
using  exactly  the  same  (x^y^)  sample  values  as  in  the  construction 
for  Taole  2.2,  but  here  we  have  assumed  that  the  variogram  is 
spherical  (the  parameters  again  being  calibrated  by  a least  squares 
procedure) . Also,  we  have  applied  the  same  simulated  data  to  the 
Watson  nonparametric  regression  method. 
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One  will  notice  that  in  all  cases,  the  estimation  capabilities 
exhibited  by  the  various  rules  are  quite  comparable.  Interestingly  ! 
enough,  the  SDherical  vanoqram  is  also  competitive,  even  though 
the  model  is  wrong.  But,  esoeciallv  in  the  noisy  case,  the  ! 

spherical  rule  is  much  less  accurate  than  the  other  rules  m f 

I 

approximating  the  errors. 

Deihomne  (1979'  has  claimed  that  classical  function  interpola- 
tion and  approximation  methods  are  not  effective  with  intrinsic 
random  functions.  Our  experience  with  cubic  splines,  Lagrange  in-  1 

terpolation,  and  least  squares  approxmiation  concurs  with  this  , 

assessment.  In  Table  2.4,  we  present  the  estimates  obtained  from 

r 

using  the  IMSL  cubic  spline  package  on  the  data  points  used  for 
calculations  in  the  preceding  tables. 

* 

We  applied  the  krigmg  integration  algorithm  to  the  function 
f(x)  which  has  served  as  basis  for  the  calculations  reported  in  the 
preceding  tables.  The  same  data  pairs  {(x  ,y^)}  were  used.  The 

* 

results  of  the  integration  estimation  studies  are  summarized  in 

■i 

Table  2.5  below. 

n 

i 

4 

1 


■>  -1*- 


•*  f 


DOMAIN 

POINT 

X* 


TRUE  VALUE 
f (X*) 


OREDICTED 

VALUES 

£50(X*) 


EXPECTED 

' ERROR  - . 

E[(f  (X*)-f  50(X*))  ] ' 


• linn 
_^222222_ 
.333333 
. 4 4 4 4 4 4 
.555536 
_jl66666.Z_ 
77 7-7778 
•888889 


-.273470 
.009991 
.203601 
.175779 
-.621823 
— *JD1Q971 
-.14196'. 
-.165034 


-.422838 
.032374 
.353256 
.176909 
-.379635 
.-.069117  . 
-.096466 
-.165653 


.191929 
.6221  ? 
.1299*; 
.014626 
.29812*- 
• 22 4 lot 
.I2023f 
.00701 


o=0,  Exponential  Covariogram,  a = 1/25 


.111111 
.222222 
, 333333 
.444444 
.555596 
. 666667 
.777778 
.808889 


-.273470 

-.009991 

.203601 

.175779 

-.621823 

.010971 

-.141964 

-.165034 


-.  4 13973 
,029094 
.34" 390 
. 176730 
-.  3 72394- 
-.067460 
-, 096163 
-.145523 


.2  33  940 
.710130 
.159101 
,010131 
. 363909 
.273565 
.140534 
.009691 


o=0.  Exponential  Covariogran,  Calibrated  a 


.111111 

~V2  22  2 22 
, 333 . 33 
. 4 ",  4 

.5555-36 
.666667 
.777778 
,038889 


-.273470 

-.009991 

,203601 
.175779 
=T67flo2  3 
.010971 
-.141064 
-.165034 


-.492881 
—.119208“' 
/ 7/926 
. .-••.YQ-3.Q- 
-,:o954 IT 
.311225 
» 4 o > °4  2 
.195491 


.191929 
.622137" 
.3 C9905 
.011628 
. 298123 
.224106 
.120230 
'007817 


z = 0.5,  Exponential  Covariogram,  a = 1/25 


min 
222222 
333333 
444444 
565396 
b6uu o 7 
, 7 7 7 7 78 
, 3 -j  0 S 8 9 


-.273470 

-.009991 

.203601 

.175779 

-.621823 

.oiom 

-.141904 

-.165034 


-, 309462 
.0973) S 
.6296-  j 
. 092593 
-.200567 
,216333 
. 4 403 " 0 
. 197299 


.638156 
1.01402V 
.478127 
.066360 
.396092 
. 722449 
,409349 
.035730 


e = 0.5,  Exponential  Covariogram,  Calibrated  a 

TABLE  2.2 

RANDOM  FUNCTION  ESTIMATES,  I 
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DOMAIN 

POINT 

X* 


TRUE  VALUE 
f (X*) 


PREDICTED 

VALUE 

f50(X*> 


EXPECTED 

ERROR 

E[(f  (X*)-f5^X*))^]i/2 


-rmi— 

.333333 

, .****** 

•666667 

.777778 

.888889 


-stfHW' 

.2:03601 

.175779 

__=jj6.21.823. 
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.0*9139 
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-.120925 


.-Q1Z9  24  „ 

.01*013 
. 00  511 
. 007597 

..00.75  65 

.027576 

.006*61 

.027717 


a = 0,  Spherical  Variogram 


.111111 
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.333333 
_»*-**.  *4A. 
.555556 
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-.273*70 

-.009991 

.203601 

.J.75779 

-.62  182  3 
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-.16503*  .. 
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.02838  1 
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.0722.91. 
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. 100016 
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..-.10073  2 


.126*8  5 
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. 1 *U29  3 
• 09 13  9t 
.11071* 
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.031628 
.177*56 


o=0,  Watson  Algorithm 
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* §9660 i .010971  . 3 5*6  ’0 
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' = 0.5,  Spherical  Variogram 


.3*7720 

.161325 

..091229 

.03516* 

.096362 
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..079503 

.026907 


1 n i 

- - ? ? 3 '»  7 0 
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'91 

— 766  7 1 
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■ ' i 3 

.70  -'Gl 
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7 7 7 7.3 
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. 173377 

.197700 

o • n r "» 

-.16503* 

— .1216*8 

.2^09s9 

0.5,  Watson 

Algorithm 

TABLE  2.3 

RANDOM  FUNCTION  ESTIMATES,  II 
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DOMAIN 

POINT  TRUE  VALUE  SPLINE  VALUE 

X*  f(X*) 


- uu 
'III22* 

•333333 

• 4 A 4 44  4 

- *555556 — 
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« 8P"  P89 


-.273470 
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• 175779 
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a = 0.5 


TABLE  2 . 4 
SPLINE  ESTIMATES 


) 
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Exact  Value  of  /f(x)  dx  =-0.089 


No 

Approximation 

Noise 

Estimated  Standard 
Deviation  of  Error 

Noise, 

Approximation 

a = 0.5 

Estimated  Sta 
Deviation  of 

Exponential  Vanogram, 
with  a = 1/25 

-0.051 

0.93 

0.126 

0.927 

Exponential  Vanogram, 
fitted  parameter 

-0.053 

0.94 

0.088 

0.94 

Spherical  Vanogram 

-0.043 

0.13 

0.079 

0.34 

Watson  Method 

-0.056 

0.19 

0.069 

0.21 

TABLE  2.5 

Integration  Extimation  Experiments 
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ABSTRACT 


Laplaclan  smoothing  splines , smoothing  splines  on  the  sphere  and  smooth- 
ing pseudo  splines  on  the  sphere  are  presented.  The  method  of  generalized 
cross  validation  to  choose  the  smoothing  parameter  Is  described.  An  applica- 
tion of  these  methods  to  estimate  divergence  and  vorticlty  of  the  atmosphere 
from  wind  spaed  and  wind  direction  Is  provided. 


.flhY* 


f 

i 


1.  Introduction 

This  report  portrays  the  status  of  current  research  Into  a meteorological 
application  which  involves  the  use  of  multidimensional  smoothing  splines. 
Aspects  of  meteorology,  theoretical  and  applied  mathematics,  statistics, 
numerical  analysis  and  computer  science  are  involved  in  the  analysis.  A more 
detailed  dissertation  involving  this  problem  is  provided  in  Wahba  and 
Wendelberger  (1980),  Wendelberger  (1981)  and  Wendelberger  (1982).  The  re- 
levance to  problems  encountered  in  remote  sensing  are  mentioned  in  a very 
general  way  throughout.  Research  topics  involving  the  application  of  multi- 
dimensional smoothing  splines  are  provided. 

To  review  the  work  being  done  in  this  area  there  are  sections  about 
Laplaclan  smoothing  splines,  smoothing  pseudo  splines  on  the  sphere  and  the 
method  of  generalized  cross  validation.  These  three  sections  are  followed  by 
one  which  involves  the  analysis  of  meteorological  data  which  is  of  the  type 
which  may  be  encountered  in  the  application  of  remote  sensing.  The  last  sec- 
tion proposes  some  future  research  areas. 

2.  The  Laplaclan  Smoothing  Spline 

A Laplaclan  smoothing  spline  (LSS)  is  a function  defined  from  Euclidean 
d-space,  Rd,  to  R which  arises  as  the  estimate  from  the  statistical  model  pre- 
sented below.  The  term  model  is  meant  in  the  broad  sense  of  Box  (1981).  In 
that  sense  we  tentatively  entertain  the  assumptions,  provide  a solution  using 
the  data  and  then  check  the  validity  of  the  assumptions.  In  this  section  we 
present  the  assumptions  which  this  model  entertains  and  provide  a solution 
using  the  data.  The  question  of  model  validity  will  not  be  dealt  with  here. 

In  the  model,  the  data  Z]  e R,  1=1,... ,N  consist  of  a fixed  component  and 
a random  component.  The  fixed  (or  signal)  component,  L}f,  in  its  most  general 
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form,  Is  a continuous  linear  functional  , 1 = 1,.. .,N,  of  a function 
f e X,  X the  appropriate  Sobolev  space,  Adams  (1975),  to  R.  The  random  (or 


noise)  component  ej  e R satisfies 

E ei  * o,  1*1,. ..»N,  2.1 

E ej2  = Var  ei  * <j2ci2,  1=1 N,  2.2 

for  o2,  a\2,  constants  with  o2  unknown,  oi2  known  and  the 

e1 , 1=1 N are  Independent.  2.3 


In  2.1  and  2.2  E means  mathematical  expectation  with  respect  to  the  error  dis- 
tribution of  ei . in  2.2  the  oi  are  known  weights  which  should  be  thought  of 
as  relative  measurement  error  variances.  The  fixed  and  random  components  are 
additive ; 

21  = Li f + ei , 1=1 N.  2.4 

We  concern  ourselves  here  wUh  the  evaluation  functionals,  Lif  = f(ti), 
where  tj  e R^»  1=1,... ,N  and  the  ti  are  considered  to  be  known  without  error. 
Then  2.4  becomes 

21  = f (ti ) + ei , 1=1,... ,Nc  2.5 

Applications  of  remote  sensing  may  Involve  continuous  linear  functionals  other 
than  the  evaluation  functionals.  For  a further  discussion  of  the  use  of  gen- 
eral continuous  linear  functionals  see  Wahba  and  Wendelberger  (1980). 

To  recover  an  estimate  of  feX,  say  g,  from  the  observations  z = 
(zi,...,zn)T  we  require  that  f be  smooth.  By  smooth  it  is  meant  that  Jm(f)  is 


small  where 


t 

I 

» 

t 

! 
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Jm(f)  = E 


m! 


L c 


m 

3 f(t) 


^ °ltv! i*. . ,ad , v!  Rd  ®l»v  ad,v 

3tl  3td 


] dt,  2.6 


( m+d - l\ 

for  M = y d-1  J,  t= 


and  the  ai,v,...,ad,v  are  the  M1  unique 


combinations  of  {0,1,. ...m}  such  that  «1>w  + ...  + ad,v  . m.  The  smoothness 
function  is  induced  by  the  Sobolev  space  X of  which  f is  a member. 

Besides  being  smooth  f should  also  be  close  to  the  data.  To  measure 
closeness  define 

N 2 

c(f)  = 1 (Cf(ti)  - zi]/oj)  . 2.7 

1=1 


As  defined  here  closeness  and  smoothness  are  conflicting  criterea.  To  measure 
the  tradeoff  between  the  two  we  Introduce  the  parameter  X . The  choice  of  X 
will  be  discussed  in  section  4.  The  estimate  g of  f is  chosen  as  the  minimiz- 
er  of 


C(f)  + A Jm  (f). 


The  mlnimizer  of  2.8  can  be  shown  to  be  of  the  form 
N M 

9(t)  “ = c1n  (t.ti)  + I dvtv(t)  2.9 

1=1  Jm  v=l 


■ ■ kltuTI*!  ^ 


where 


♦v(t)  * the  M polynomials  of  total  degree  less 

m-1 

than  m which  span  Pd  , 


M. 


the  space  of  all  polynomials  of  total  degree 
less  than  m. 


o = a function  of  | t-t-j  J which  depends  on  Jm  and 

dm 

is  rigorously  defined  In  Wahba  and  Wendelberger  (1980), 


c = (c j cn)t»  d = (di,...,dM)T  are  constants  which  arise  as  the  solu- 

tion to  the  linear  system 


(K  + NXDa2)  c + Td  = Z 


TTc  = 0 2*13 

where  D02  = dlag  ( o\z on2)  and  the  N by  N matrix  K and  N by  M matrix  T 

depend  only  on  t^ , 1=1 N and  Jm(-)* 

The  estimate  g along  with  the  assumptions  stated  In  this  section  and 
those  made  in  section  4 involving  the  choice  of  X constitute  the  Laplacian 
smoothing  spline  model. 

The  laplacian  smoothing  spline  Is  given  by  g in  2.9.  In  remote  sensing 
applications  which  Involve  a small  section  of  a sphere  (the  earth),  the 
Laplacian  smoothing  spline  is  appropriate.  However,  for  applications  which 
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ABSTRACT 

Laplaclan  smoothing  splines,  smoothing  splines  on  the  sphere  and  smooth- 
ing pseudo  splines  on  the  sphere  are  presented.  The  method  of  generalized 
cross  validation  to  choose  the  smoothing  parameter  Is  described.  An  applica- 
tion of  these  methods  to  estimate  divergence  and  vortlclty  of  the  atmosphere 
from  wind  speed  and  wind  direction  Is  provided. 
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1.  Introduction 

This  report  portrays  the  status  of  current  research  into  a meteorological 
application  which  Involves  the  use  of  multidimensional  smoothing  splines. 
Aspects  of  meteorology,  theoretical  and  applied  mathematics,  statistics, 
numerical  analysis  and  computer  science  are  involved  in  the  analysis.  A more 
detailed  dissertation  involving  this  problem  is  provided  in  Wahba  and 
Wendelberger  (1980),  Wendelberger  (1981)  and  Wendelberger  (1982).  The  re- 
levance to  problems  encountered  in  remote  sensing  are  mentioned  in  a very 
general  way  throughout.  Research  topics  involving  the  application  of  multi- 
dimensional smoothing  splines  are  provided. 

To  review  the  work  being  done  in  this  area  there  are  sections  about 
Laplacian  smoothing  splines,  smoothing  pseudo  splines  on  the  sphere  and  the 
method  of  generalized  cross  validation.  These  three  sections  are  followed  by 
one  which  involves  the  analysis  of  meteorological  data  which  is  of  the  type 
which  may  be  encountered  in  the  application  of  remote  sensing.  The  last  sec- 
tion proposes  some  future  research  areas. 

2.  The  Laplacian  Smoothing  Spline 

A Laplacian  smoothing  spline  (LSS)  is  a function  defined  from  Euclidean 
d-space,  Rd,  to  R which  arises  as  the  estimate  from  the  statistical  model  pre- 
sented below.  The  term  model  is  meant  in  the  broad  sense  of  Box  (1981).  In 
that  sense  we  tentatively  entertain  the  assumptions,  provide  a solution  using 
the  data  and  then  check  the  validity  of  the  assumptions.  In  this  section  we 
present  the  assumptions  which  this  model  entertains  and  provide  a solution 


using  the  data.  The  question  of  model  validity  will  not  be  dealt  with  here. 

In  the  model,  the  data  e r,  1=1, ...,N  consist  of  a fixed  component  and 
a random  component.  The  fixed  (or  signal)  component,  L-jf,  in  its  most  general 
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n 
n 

form,  Is  a continuous  linear  functional  Lf , 1=1 N,  of  a function 

f c X,  X the  appropriate  Sobolev  space,  Adams  (1975),  to  R.  The  random  (or  FI 

noise)  component  ej  e R satisfies 

E ei  ■ 0,  1=1, ...,N,  2.1  r-. 

!i 

E e^2  = Var  ej  * <j2oi2,  1=1,... ,N,  2.2  _ 

11 

for  o2,  oi2,  constants  with  o2  unknown,  oi2  known  and  the 

I * 

; I 

e1 » 1=1,. ...N  are  Independent.  2.3 

rj 

In  2.1  and  2.2  E means  mathematical  expectation  with  respect  to  the  error  dis- 
tribution of  ei . in  2.2  the  oi  are  known  weights  which  should  be  thought  of 

I t 

as  relative  measurement  error  variances.  The  fixed  and  random  components  are 
additive;  • 

z\  = Li f + ei,  1=1 N.  2.4 

We  concern  ourselves  here  with  the  evaluation  functionals,  Lif  = f(ti), 
where  ti  c Rd,  1=1, ...,N  and  the  ti  are  considered  to  be  known  without  error. 

Then  2.4  becomes  j 

■j 

zi  = f(ti)  + ei , 1>1,...,N.  2.5  i 

Applications  of  remote  sensing  may  Involve  continuous  linear  functionals  other 
than  the  evaluation  functionals.  For  a further  discussion  of  the  use  of  gen- 
eral continuous  linear  functionals  see  Wahba  and  Wendelberger  (1980). 

To  recover  an  estimate  of  feX,  say  g,  from  the  observations  z = 

(z1.....zn)T  we  require  that  f be  smooth.  By  smooth  it  is  meant  that  Jm(f)  Is 
small  where 
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M* 

Jm(f)  = £ 


ml 


L C 


m 

a f(t) 


v_ ^ al»v!»..»,ctd,v!  al»v  arf  v 

3tl  ,...,3td 


] dt,  2.6 


for 


( m+d-l\ 

H'  = \ d-1  J , U 


td)  and  the  ai,v»...,ad,v  are  the  M1  unique 


combinations  of  {0,1 m>  such  that  «ltv  + ...  + od,v  = m.  The  smoothness 

function  is  induced  by  the  Sobolev  space  X of  which  f is  a member. 

Besides  being  smooth  f should  also  be  close  to  the  data.  To  measure 
closeness  define 

N 2 

C(f)  = 1 (C^Ui)  - zi]/oi)  . 2.7 

i=l 


As  defined  here  closeness  and  smoothness  are  conflicting  criterea.  To  measure 
the  tradeoff  between  the  two  we  introduce  the  parameter  X . The  choice  of  X 
will  be  discussed  in  section  4.  The  estimate  g of  f is  chosen  as  the  minimiz- 
er  of 


C(f)  + X 4 (f).  2.8 


The  minimi zer  of  2.8  can  be  shown  to  be  of  the  form 
N M 

9(t)  = £ cin  (t,ti)  + E dv<j>v(t)  2.9 

1=1  Jm  v=l 


ifciul.aw1.  « •>! 


J 
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n 


where 

♦v(t)  = the  M polynomials  of  total  degree  less 

m-1 

than  in  which  span  P<j  , 

(d+m-A 
, d / , 

m-1 

p * the  space  of  all  polynomials  of  total  degree 
d 

less  than  m, 

n » a function  of  |t-tj  | which  depends  on  Jm  and 
m Is  rigorously  defined  in  Wahba  and  Wendelberger  (1980),  I I 

n 

c * (ci,...,cn)t»  d = (di dM)T  are  constants  which  arise  as  the  solu- 

: i 

tlon  to  the  linear  system  i j 

. i 

(K  + NXD02)  c + Td  = Z 2.12  I 1 

1 

and 

' T 

TTc  « 0 2.13 

where  D02  = dlag  (oj2,. . . ,on2)  and  the  N by  N matrix  K and  N by  M matrix  T , 

depend  only  on  t< , 1=1 N and  Jm(*). 

The  estimate  g along  with  the  assumptions  stated  in  this  section  and  I 

those  made  In  section  4 involving  the  choice  of  X constitute  the  Laplacian  \ 

smoothing  spline  model. 

i 

The  Laplacian  smoothing  spline  is  given  by  g in  2.9.  In  remote  sensing 
applications  which  involve  a small  section  of  a sphere  (the  earth),  the 
Laplacian  smoothing  spline  is  appropriate.  However,  for  applications  which 


2.10 

2.11 
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involve  a large  area  of  a sphere  we  require  splines  which  have  the  surface  of 
the  sphere,  rather  than  Rd,  as  their  ciomai’.i.  These  splines  are  developed  in 
the  next  section. 


3.  Smoothing  Splines  on  the  Sphere 

* 

Smoothing  splines  on  the  sphere,  as  investigated  by  Wahba  (1981),  are 
developed  both  as  an  extension  of  one  dimensional  periodic  polynomial  splines 
and  as  a restriction  of  three  dimensional  thin  plate  (Laplacian)  smoothing 
splines  to  the  surface  of  the  sphere.  The  derivation  of  smoothing  splines  on 
the  sphere  parallels  that  of  Laplacian  smoothing  splines.  In  this  section  we 
provide  the  modifications  of  section  2 which  are  required  to  obtain  smoothing 
splines  on  the  sphere. 

The  first  modification  is  that  the  Independent  variable  space  ,Rd,  is  re- 
placed by  the  surface  of  the  sphere,  S.  This  means  that  the  tf  In  2.5  become 

ti  e S,  i = 1 N.  In  particular  ti  = (<m,M)p»  $i  * latitude  and  Ai  = 

longitude,  i = 1 N. 

The  second  modification  is  that  in  2.11  M = 1.  This  means  in  2.9  and 
2.10  there  is  only  one  polynomial  $ (t)  = 1.  Intuitively,  this  arises  because 
of  the  necessity  of  having  a periodic  solution. 

The  third  modification  is  that  Jm(.)  in  2.6  is  replaced  by  Km(*)*  Km(*) 

is  a restriction  of  Jm(.)  to  S.  For  the  specific  form  of  Km(*)  see  Wahba 


(1981). 


The  forth  modification  is  that  n (t,tj)  both  In  2.9  and  in  the  defini. 

" 0_ 

m 


tion  of  K in  2.12  is  replaced  by 


• 2v+l 

n (t.t,)  = £ 

Km  v=l  vm(v+l)m 


Pv  (cos( A(t ,ti ) ) ) , 


3.1 
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where  cos(A(t,tf))  equals  the  cosine  of  the  angle  between  t and  ti.  Also, 
Pv(-)  Is  the  v-th  Legendre  ploynomlal. 

Modifications  one  through  four  provide  the  smoothing  spline  on  the  sphere 
which  is  given  analogously  to  2.9  as 


h(t)  - Z cin  (t.ti)  + di,  m = 2,3, 
1-1  Km 


To  obtain  and  evaluate  the  smoothing  spline  on  the  sphere,  3.1  must  be 
evaluated.  Wahba  (1981)  notes  that  the  series  given  in  3.1  cannot  be  express- 
ed in  terms  of  elementary  functions.  To  compute  smoothing  splines  on  the 
sphere,  the  accurate  and  fast  evaluation  of  3.1  Is  necessary.  To  alleviate 
the  difficulties  which  this  entails,  she  derives  smoothing  pseudo  splines  on 
the  sphere. 

To  obtain  these  splines  Km(*)  is  replaced  by  a topologically  equivalent 

norm  Lm(*)*  In  both  3.1  and  3.2  Km( • ) is  replaced  by  Lm(*)>  with 

specific  expressions  for  n (t.ti)  given  in  Wahba  (1981).  For  illustrative 

lm 

purposes  we  provide  for  m « 2 

1 (t.ti)  = ln(l+(2/(1-z)) 1/2)[3z2-2z-1]/2 
l2 

3.3 

-6[(1-z)/2]3/2  + 2 - 3z/2, 


with  z = cos(A(t,ti)).  The  smoothing  pseudo  spline  on  the  sphere  is  thus 

easily  computed  by  using  expressions  like  3.3  to  obtain  n (t.t,). 

Ln 


4.  Generalized  Cross  Validation 

In  applications  the  smoothing  parameter  X Is  unknown.  To  determine  an 
estimate  of  this  parameter.  Craven  and  Wahba  (1979)  and  Golub,  Heath  and  Wahba 
(1979)  have  suggested  the  use  of  generalized  cross  validation.  To  enhance  the 
understanding  of  this  method  a short  synopsis  of  its  development  Is  given. 

The  method  of  cross  validation  (presented  here  as  related  to  LSS's)  is 
developed  in  response  to  the  question:  How  well  may  one  expect  LSS's  to  pre- 
dict the  true  functional  value  f(t)  at  some  point  t? 

Simple  cross  validation  (SCV)  suggests  predicting  the  true  functional 
values  of  data  different  from  that  used  in  the  analysis  to  assess  this  pre- 
dictive ability.  In  SCV's  simplest  form  this  entails  dividing  the  sample  into 
two  pieces  of  similar  size,  using  one  section  for  optimization  and  the  other 
for  testing.  In  addition,  in  order  to  gain  more  information  from  the  data, 
the  two  pieces  may  be  interchanged  and  the  optimization  and  testing  performed 
on  each. 

SCV  is  alright  if  there  is  an  ample  supply  of  data  so  that  halving  or 
doubling  the  data  has  little  effect  on  the  quality  of  the  estimator.  To 
lessen  this  effect  Mosteller  and  Tukey  (1968)  propose  single  cross  validation 
(lCV).y  (called  ordinary  cross  validation  by  Wahba  (1979)),  which  is  described 
suitably  by  them  as  follows: 

"Suppose  that  we  set  aside  one  Individual  case,  optimize  for  what  is  left, 
then  test  on  the  set-aside  case.  Repeating  this  for  every  case  squeezes 
the  data  almost  dry.  If  we  have  to  go  through  the  full  optimization  cal- 
culation every  time,  the  extra  computation  may  be  hard  to  face.  Occa- 
sionally, one  can  easily  calculate,  either  exactly  or  to  an  adequate  ap- 
proximation, what  the  effect  of  dropping  a specific  and  very  small  part 


of  the  data ‘will  be  on  the  optimized  result.  This  adjusted  optimized  re. 
suit  can  then  be  compared  with  the  values  for  the  omitted  individual. 

That  Is,  we  make  one  optimization  for  all  the  data,  followed 
by  one  repetition  per  case  nf  a much  simpler  calculation,  a calculation 
of  the  effect  of  dropping  each  individual,  followed  by  one  test  of  that 
individual.  When  practical,  this  approach  is  attractive." 

(j) 

To  describe  1CV  mathematically  we  require  some  notation.  Let  f*  be 
the  solution  to  the  minimization  of  2.8  with  the  jth  point  removed  from  the 

(j) 

analysis.  Similarly,  Da  is  the  N-l  by  N-l  matrix  composed  of  Da  with  its 
j-th  row  and  column  removed.  To  "test  on  the  set  aside  case"  we  require  that 

(j)  , 

C(f\  (tj)  -Zj)/oj]  be  small.  "Repeating  this  for  every  case"  and  averaging 
to  yield  an  overall  test  gives 
_ N ( j ) 

VmV)  - (1/N)  r C(f X '(tj)  - zj)/oj]2.  4.1 

j=l 

1CV  uses  the  X which  minimizes  Vm°(X),  Wahba  and  Wold  (1975). 

To  minimize  Vm°(x)  directly  is  not  a trivial  computational  matter.  For 
each  proposed  value  of  X a system  of  the  form  2.12  and  2.13  (of  order  N+M-l 
Instead  of  N+M)  must  be  solved  for  each  of  the  N values  left  out  of  the  anal- 
ysis. This  entails  solving  a linear  system  of  order  N+M-l  N times!  As  noted 
earlier,  "if  we  have  to  go  through  the  full  optimization  calculation  every 
time,  the  extra  computation  may  be  hard  to  face."  Following  the  idea  of 
Mosteller  and  Tukey  we  seek  a computational  simplification  for  the  minimizer 
of  Vm°(X). 

The  simplified  form  for  1CV  was  first  noted  by  Craven  and  Wahba  (1979), 
Golub,  Heath  and  Wahba  (1979)  and  given  in  a slightly  more  general  form  in 
Wahba  and  Wendelberger  (1980).  The  1CV  function  may  be  written 
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Vm°(M  - (1/M)  = CfA(tj)  - zj)/(aj(l-ajj(X)})]2.  4.2 

j=l 

«jj (X)  is  the  jth  diagonal  element  of  4n(x)  which  is  defined  by 

Am(A)z  = 

where  g is  the  solution  of  2.8.  Am(x)  may  be  thought  of  as  mapping  the  vector 
z into  the  smoothed  values. 

In  this  form  "we  make  one  optimization  for  all  the  data"  by  calculating  g 
which  is  then  "followed  by  one  repetition  per  case  of  a much  simpler  calcu- 
lation, a calculation  of  the  effect  of  dropping  each  individual."  Here  find 
ajj(x)  and  use  (4.2). 

Evaluation  of  this  formulation  of  Vm°(x)  involves  solving  a linear  system 
of  size  N+H  to  find  g and  one  of  size  N to  find  ajj(x).  This  is  a consid- 
erable improvement  over  using  4.1  directly.  Because  of  a 
mathematical  simplification  the  amount  of  computation  needed  to  minimize 
Vm°(X)  can  be  substantially  reduced.  From  a practical  point  of  view  this 
makes  the  use  of  cross  validation  very  attractive. 

When  applying  cross  validation  to  problems  other  than  LSS's,  this  last 
step  of  finding  "what  the  effect  of  dropping  a specific  and  very  small  part  of 
the  data  will  be  on  the  optimized  result"  is  very  important  and  should  not  be 
overlooked.  In  fact,  this  step  often  makes  cross  validation  computationally 
feasible,  whereas,  without  this  insight  it  may  be  impractical. 
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Finding  the  minimlzer  of  Vm°(x)  requires  evaluation  of  Vm 0 ( X ) at  differ- 
ent values  of  X as  determined  by  a search  routine.  Hence,  although  the  mini- 
mization is  possible,  we  need  to  repeatedly  solve  large  linear  systems  with 

the  number  of  soluti.,i  times  being  a function  of  the  search  routine  employed. 

(1 ) 

In  Vm  (x)  of  4.1  each  deviation  of  fx  (ti)  from  the  observed  value  zi 
is  treated  symmetrically.  This  choice  is  arbitrary  and  is  chosen  for  simpli- 
city. A more  general  approach  is  to  weight  each  term  of  4.1  nr  equivalently 
4.2  to  yield 

N 

Vm(fx)  = (1/N)_e  Wi[(fx(tl)  - zi)/(oi(l-aii(X)))J2.  4.3 

Before  discussing  the  choice  of  these  weights,  the  following  definition  is 
needed. 

Definition: 

N 

Rm(x)  = E(l/N)^[(f(ti)  - gftOJ/cn]2 


Is  the  expected  weighted  (by  of)  mean  squared  error  between  the  true  function 
(f)  and  the  spline  (g)  evaluated  at  the  independent  variables  (t^).  E denotes 
mathematical  expectation  with  respect  to  the  error  distribution  of  the  random 
errors  as  described  in  the  model  of  Section  2. 

If  we  want  Rm(x)  to  be  small,  then  the  generalized  cross  validation  value 
of  X should  be  used  as  the  smoothing  parameter  value.  Using  1CV  as  motivation 
Craven  and  Wahba  (1979)  and  Golub,  Heath  and  Wahba  (1979)  have  shown  that  the 
X which  minimizes  Vm(x)  with  weights 

N 

wi  = (1-aii (X))2/(l-N-l  e ajj(x))2 

J=1 
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Is  an  estimate  of  the  A which  minimizes  R^a).  Using  these  weights  In  4.3 
gives  the  general  led  cross  validation  function  (GCVF) 

N N 

Vm(X)  = (1/N)  Z [(g(ti)-zi)/(oi(l-N_1i:  ajj(A)))]2  . 4.4 

i=l  j=i 

The  minimizer  of  4.4  is  called  the  GCV  estimate  of  A. 

Ihe  GCVF  can  be  rewritten  as 

Vm(A)  = (1/N) | |Do_1(I  - An( A))zJ  J2/((l/N)Tr(I-Ani(A)))2,  4.5 

where  Tr  Is  the  trace.  — 

Wanba  (1981)  has  proposed 

°e2  *=  l|Da-1(I-An(A))z||2/Tr(I-An(A))  4.6 

as  an  estimate  of  the  error  variance  o2.  This  leads  us  to  consider  dfe  * 
TrO-AmfA))  as  the  equivalent  degrees  of  freedom  of  error  Wahba  (ly82).  Using 
these  notions  we  rewrite  the  GCVF  as 

Vm(A)  = Noe2/dfe  . 4.7 

The  method  of  GCV  may  be  viewed  as  minimizing  the  estimated  error  vari- 
ance per  error  equivalent  degrees  of  freedom. 

5.  Estimation  of  Height,  Wind,  Divergence  and  Vorticity 

In  this  section  we  provide  a preliminary  report  cf  the  analysis  of  some 
meteorological  data.  For  a discussion  of  the  analysis  of  Monte  Carlo  experi- 
ments using  Lapiacian  smoothing  splines  see  Wendelberger  (1981)  and  Wahba  and 
Wendelterger  (1930).  The  data  to  be  analysed  are  obtained  from  the  irregular- 
ly spaced  North  American  radiosonde  network  during  the  Ohio  storm  of  00  Z 
January  25,  1978. 
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The  height,  hj , wind  speed  and  wind  direction  are  reported  with  measure- 
ment error  at  the  850,  700,  500,  400,  300,  250,  200,  150  and  100  mb  pressure 
levels.  To  analyse  the  wind  the  Uj  (east)  component  and  the  vi  (north)  com- 
ponent are  obtained  from  the  wind  speed  and  wind  direction  measurements.  Us- 
ing those  stations  and  levels  for  which  all  three  components, 

(hi,  ui,  vi)  are  obtained  yields  112,  117,  116,  116,  113,  114,  109,  108  and  93 
observations,  respectively,  for  each  pressure  level. 

Us<ng  the  Laplacian  smoothing  spline  model  the  method  of  sections  2 and  4 
with  m = 4 provides  three  fitted  surfaces  hp,  up  and  vp  for  each  pressure 
level  p.  Figure  1 provides  the  height  field,  hp,  for  p = 850,  500  and  200  mb. 
The  synoptic  patterns  are  in  general  agreement  with  the  National  Meteorolog- 
ical Center's  analysis.  Figure  2 gives  the  isotachs  and  streamlines  for  up 
and  vp  with  p = 850,  500  and  200  mb.  The  Isotachs  are  levels  of  constant  wind 
speed  and  the  streamlines  denote  the  wind  direction. 

The  vorticity,  V,  and  horizontal  divergence,  D,  may  be  obtained  from 

V = [(3v/3X)/cos<j>  - 3u/3(j>  + u*tan<j>]/R  5.1 


and 

0 = [( 3u/3X)/cosij>  - 3v/3<}>  - v*tan$]/R  5.2 

where  R is  the  radius  of  the  earth,  $ = latitude  and  \ = longitude.  Figure  3 
is  obtained  from  5.1  and  5.2  using  up  and  vp  for  p = 850,  500  and  200  mb.  The 
500  mb  vorticity  pattern  is  in  excellent  agreement  with  the  National  Meteoro- 
logical Center's  analysis  which  is  unavailable  for  comparison  at  the  other 
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(la)  200  mb 


(1b)  500  mb 


(1c)  850  mb 


■V 


Figure  1:  Height  Fields,  X 10  meters 


401 


f 

I 


original  PAGE  IS 

_ i Al  I TV 


f] 


(2a)  200  mb  Isotachs 


(2b)  200  mb  Streamlines 


(2e)  850  mb  Isotachs 


(20  850  mb  Streamlines 


Figure  2:  Isotachs  (m/sec)  and  streamlines. 
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(3e)  850  mb  Vorticlty  C3f)  850  mb  Divorgonco 


Figure  3 : Vorticlty  and  divergence,  X 10“5/sec 
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The  results  presented  here  will  be  supplemented  with  estimates  of  the  ac- 
curacy of  these  fields  by  the  method  presented  in  Wahba  (1981).  The  smoothing 
pseudo  splines  on  the  sphere  will  be  employed  to  obtain  h,  u,  v and  the  re- 
sulting divergence  and  vorticity  estimates. 


6.  Further  Research 

In  this  section  we  list  some  further  research  ideas. 

The  current  computational  method  used  for  Laplacian  smoothing  splines  re- 
quires a spectral  decomposition  of  an  N-M  by  N-M  matrix,  Wendelberger  (1981). 
It  seems  likely  that  the  calculation  of  all  N-M  of  the  eigenvalues  and  eigen- 
vectors Is  unnecessary.  An  algorithm  which  determines  how  many  eigenvalues 
are  needed  would  be  extremely  useful;  then  a truncation  algorithm  could  be 
obtained  to  compute  the  spline.  Bates  and  Wahba  (1982). 

Often,  given  N observations  for  which  the  analysis  has  been  performed,  we 
may  need  to  update  or  downdate  this  set  of  observations  by  the  inclusion  or 
exclusion  of  a single  observation.  An  algorithm  which  does  not  require  the 
spectral  decomposition  to  be  performed  on  the  new  N-M+l  by  N-M+l  or  N-M-l  by 
N-M-l  matrix  would  be  very  valuable.  We  could  then  generalize 
this  to  updating  and  downdating  by  a small  number  of  points.  The  usefulness 
of  this  type  of  algorithm  is  very  apparent  in  the  example  provided  in  section 
5. 

In  remote  sensing  applications  different  continuous  linear  functionals  Lf 
will  be  required.  These  need  to  be  identified  and  their  fast  and  accurate 
computational  algorithms  need  to  be  designed.  For  a specific  example  see 
Nychka  (1983). 

In  remote  sensing  applications,  experiments  need  to  be  designed  which 
will  demonstrate  the  utility  of  smoothing  splines.  These  will  include  Monte 
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Carlo  runs  with  data  similar  to  that  obtained  in  practice  and  confidence 
statements  about  the  estimates  obtained. 

Methods  to  check  the  validity  of  model  assumptions  must  be  devised.  A 
probability  plot  of  the  residuals  is  one  such  method,  see  Kendelberger  (1981) 
and  Wendelberger  (1982). 
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ABSTRACT 


laplaclan  Smoothing  Splines  (LSS)  are  presented  as  generalizations  of 
graduation,  cubic  and  thin  plate  splines.  The  method  of  generalized  cross 
validation  (GCV)  to  choose  the  smoothing  parameter  Is  described.  GCV  Is  used 
In  the  algorithm  for  the  computation  of  LSS's.  An  outline  of  a computer 
program  which  implements  this  algorithm  Is  presented  along  with  a description 
of  the  use  of  the  program.  Examples  in  one,  two  and  three  dimensions 
demonstrate  how  to  obtain  estimates  of  function  values  with  confidence 
Intervals  and  estimates  of  first  and  second  derivatives.  Probability  plots 
are  used  as  a diagnostic  tool  to  check  for  model  inadequacy. 
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1.  Motivation 

A Laplaclan  smoothing  spline  (LSS)  Is  a statistical  tool  used  to  model  a 
smooth  but  otherwise  unknown  function.  The  fitted  spline  provides  an  analytic 
function  which  may  be  utilized  to  estimate  derivatives.  Integrals  or  values  of 
the  underlying  function.  For  data  analysis  purposes  a graphical  display  of 
the  fitted  spline  (or  cross  sections  for  multidimensional  problems)  often 
provides  insight  which  might  otherwise  remain  masked  by  the  Irregularly 
spaced,  multidimensional  and  "noisy"  data.  The  residuals,  which  are  the 
observed  values  of  the  dependent  variable  minus  the  corresponding  fitted 
spline  values,  may  be  utilized  as  an  aid  In  model  checking.  A probability 
plot  of  the  residuals  provides  a vehicle  to  detect  possibly  discrepant 
observations  (outliers).  With  the  above  ideas  as  the  eventual  objective  we 
first  elucidate  the  functional  form  of  the  LSS  and  then  describe  an  algorithm 
for  Its  computation. 

When  someone  mentions  a line,  cosine  or  an  exponential  we  all  have  a 
visual  Image  of  "feel"  for  the  function  in  question.  Using  the  following 
example  we  hope  to  provide  an  intuitive  feeling  for  an  LSS. 

In  one  dimension  imagine  a long,  thin,  perfectly  rigid  rod  (a  line)  lying 
on  a frictionless  plane  with  coordinate  axes  (t,z).  We  represent  this  rod  as 
a function  of  t,  say  g(t).  Assume  that  we  are  given  N points  In  the  plane 
{(t ,z) : (t ,z)  = (tj  ,zj ) , 1*1 .... ,N).  The  t{  are  considered  to  be  distinct  and 
known  without  error.  The  z ] are  measurements  of  a true  but  unknown  function  f 
evaluated  at  tj  plus  some  "noise"  e-|.  The  ei  are  independent  random 
variables,  each  having  mean  zero  and  finite  variance. 
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With  the  previous  setup  Imagine  that  an  ideal  spring  is  attached  to  data 
point  (tj.zj)  and  to  the  rod  (t-j  »g(t-j ) ) for  each  1,  i*l,...,N.  This  fixes  the 
springs  to  remain  parallel  to  the  ordinate  axis.  What  position  will  the  rod 
g(t)  assume? 

Physics  provides  a means  to  answer  this  question.  The  rod  will  assume 
the  position  which  minimizes  the  energy  of  the  springs.  The  energy  of  an 
ideal  spring  is  equal  to  some  positive  constant  k,  (called  the  spring 
constant)  times  the  square  of  the  length  it  is  stretched.  Thus  the  cumulative 
energy  of  the  N springs  Is 

N 

E k1(z1  - g(t<))2  . 

1-1 

This  is  minimized  when  g is  the  least  squares  line  (provided  we  restrict  3 to 
be  rigid)  therefore  the  least  squares  line  is  the  position  the  rod  will  assume 
1 f k -j  » k0,  1-1, ...,N,  k0  some  constant.  If  the  kj  are  not  all  equal  then  the 
rod  will  assume  the  position  of  the  weighted  least  squares  line.  Notice  that 
this  spring  idea  provides  an  intuitive  explanation  for  minimizing  the  residual 
sum  of  squares  in  regression. 

The  situation  Is  analogous  in  two  dimensions:  a thin  plate  of  infinite 

rigidity  (not  bendable)  would  assune  the  position  of  the  least  squares  plane. 
The  situation  In  three  dimensions,  although  not  as  easy  to  visualize,  is 
analogous.  There  are  further  restrictions  on  the  tj  which  are  rigorously 
given  in  (2.6). 

We  have  thus  far  assumed  that  the  rod  is  rigid.  This  is  not  necessary 
and  may  not  be  a good  representation  of  the  physical  phenomenon  under 
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consideration.  So  we  relax  the  rigidity  assumption  and  assume  that  the  rod  is 
flexible.  If  zero  energy  were  required  to  flex  the  rod  then  the  minimum 
energy  position  which  the  rod  would  assume  is  that  of  a function  of 
interpolaclon.  Since  the  residuals  are  zero,  this  configuration  has  zero 
energy  and  thus  Is  a minimum.  By  this  explanation  It  Is  readily  seen  that  the 
funrtlon  thus  obtained  is  not  unique.  This  anomaly  will  be  alleviated  by 
requiring  energy  to  flex  the  rod. 

Consider  the  more  realistic  case  where  the  rod  is  flexible  and  takes 
energy  to  flex.  The  spring  of  a d1v<ng  board  Is  testimony  to  this.  Note  that 
the  bending  energy  of  a rod  is  (p/oz)J2(g) , where  p/c2  is  a constant  and 

J2(9)  3 /C^2Ux)]2dx  . (1.1) 

_ os 

Therefore  the  bending  energy  is  proportional  to  curvature  which  may  be 
measured  as  J2(g)  (1.1). 

To  find  the  position  which  the  rod  will  assume  under  these  conditions  is 
equivalent  to  finding  the  function  g which  will  minimize  the  total  energy  of 
the  system 

N 

t ki(zf  - g(t -j ) ) 2 + (p/o2)  J2(g)  (1.2) 

1=1 

or  equivalently  the  minimizer  of 
N 

(1/N)  Z c2ki(z1  - g(tj ))2  ♦ (p/N)J2(g)  . (1.3) 

1 = 1 

The  function  from  a certain  class  of  functions,  X,  which  minimizes  (1.3) 
can  be  shown  to  be  a piecewise  cubic  spline.  The  function  space  x is 
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rigorously  defined  In  Wahba  and  Wendelberger  (1980).  Here  X should  be  thought 
of  as  a space  of  smooth  functions  which  map  R^  Into  R*.  There  is  much 
literature  about  cubic  splines  In  one  dimension.  To  this  author's  knowledge 
the  earliest  work  on  LSS's  is  that  of  Schoenberg  (1964);  other  Important  work 
on  splines  Is  given  In  Craven  and  Wahba  (1979),  Duchon  (1976),  Prenter  (1975), 
and  Relnsch  (1967). 

The  one  dimensional  case  generalizes  to  two  dimensions.  In  two 
dimensions  the  splines  are  called  thin  plate  splines  because  of  the  analogy  of 
minimizing  the  energy  of  a thin  plate  of  Infinite  extent.  The  earliest 
suggested  application  of  thin  plate  smoothing  splines  seems  to  have  been  by 
Harder  and  Desmarals  (1972).  They  suggested  that  spring  forces  may  be  applied 
at  the  points  of  interpolation.  This  Inspired  the  spring  analogy  given  here. 
This  spring  concept  is  equivalent  to  LSS's  In  either  one  or  two  dimensions 
(with  m*2  in  (2.1)).  Much  recent  work  on  LSS’s  has  been  done  by  Wahba  (see 
Wahba  (1979)  and  the  references  cited  there). 

In  two  dimensions  JgCs)  becomes 

- - 2 /2|_32g(xifxz) 


^2(9)  a / / T I r— ]2  dxj  dx2  • 

-»  v»Q  ' 3xi v 3x?2-v 


-»  --  v*0  ' 3xjv  3x2-_v 

J2(g)  Is  proportional  to  the  bending  energy  of  a thin  plate  (under  simplifying 
assumptions);  for  details  see  Meinguet  (1979).  However,  In  two  dimensions  the 
solution  Is  no  longer  a piecewise  cubic  but  rather  takes  the  form 


g(t)  * t c,’Tj2ln(Tj)  r + dixi  ♦ d2x2  , (1. 

1*1 

where  t,  is  the  Euclidean  distance  between  t and  tlt  that  is  t,-2  = J t-t t | 2 
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■ (tfi  - xi)2  + (tf2  - *2)2;  *ij  is  the  Jth  component  of  tj , j*l,2, 
t * (xi.xg);  c<‘  and  dv  are  constants,  1»1 N,  v«0,l,2. 

To  aid  in  understanding  (1.5)  the  function  xQ2ln(xo)  is  plotted  in  Figure 
1.1  for  to  * (0,0)  and  X2  * 0.  Rotation  of  this  function  around  the 
ordinate  axis  and  centering  at  the  point  tf  will  produce  the  radially 
symmetric  function  Tj2ln(xf).  Using  (1.5)  an  LSS  is  seen  to  be  composed  of  a 
linear  combination  of  these  radially  symmetric  functions  plus  a plane.  The 
plane  has  zero  bending  energy  but  generally  does  have  nonzero  spring  energy. 
Linear  combinations  of  the  radially  symmetric  functions  can  be  forced  to 
Interpolate  the  points  and  hence  may  have  zeo  spring  energy  but  generally 
have  nonzero  bending  energy.  This  tradeoff  between  bending  and  spring  energy, 
or  smoothness  and  Infidelity  to  the  data  (terminology  of  Wahba  (1979)),  leads 
one  to  consider  the  minimization  problem  of  Section  2 as  a generalization  of 
these  Ideas.  The  one  and  two  dimension  examples  with  m«2  are  special  cases  of 
this  generalization. 

We  see  that  the  motivation  for  one  and  two  dimensional  LSS's  is  quite 
simple  (at  least  for  m*2).  Attach  springs  to  the  data  points,  constrain  them 
to  lie  perpendicular  to  the  independent  variable  space  R^,  then  let  the  curve 
or  surface  conform  by  simple  bending  to  the  minimum  energy  configuration. 

The  Laplaclan  smoothing  spline  was  suggested  by  Duchon  (1976)  as  a 
multidimensional  generalization  of  the  thin  plate  (or  “plaques  minces"),  d-2, 
interpolating  spline.  An  LSS  is  also  a multivariate  generalization  of  the  one 
dimensional,  d*l,  “graduation'  spline  of  Schoenberg  (1964).  Furthermore,  the 
“graduation"  spline  Is  a generalization  of  the  familiar  cubic  smoothing 
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spline.  The  terminology  "Laplaclan  smoothing  spline"  was  suggested  by 
Professor  I.  J.  Schoenberg.  An  explanation  for  using  the  term  "Laplacian"  Is 

given  in  Wahba  (1979). 
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2.  Characterization 

Let  z\  * f(tj)  + ej , 1*1, ...,N.  The  tfeRd  are  known  exactly.  Me  assume 
that  the  function  f Is  smooth  but  otherwise  unknown.  By  smooth  It  Is  meant 
that  the  function  Is  well  approximated  by  a function  geX;  X is  rigorously 
defined  in  Mahba  and  Wendelberger  (1980).  X may  be  thought  of  as  a space  of 
functions  which  approximate  well  a large  class  of  functions  of  which  f is  a 
member.  The  ej  are  Independent,  zero  mean  and  finite  variance  random 

variables  with  variance-covariance  matrix  o20o2  ■ o2diag(oi2 on2)-  Here 

a2  is  an  unknown  constant.  For  example,  If  we  know  that  all  the  variarces  are 
equal  then  we  may  take  1.0  * a^2  * ...  * o^2  in  what  follows.  The  a\z  used 
here  are  Inversely  proportional  to  the  kj  of  Section  1,  that  Is,  kj  * (ooi)-2. 
The  o-j 2 may  be  thought  of  as  relative  weights  of  the  measurement  errors  ef. 

The  Zi  are  observed  dependent  variables  In  R1  and  the  corresponding  ti  are 
Independent  variables  In  Rd,  i*l,...,N. 

A Laplaclan  smoothing  spline  Is  the  function  g which  is  the  solution  to 
the  problem. 

Find  geX,  X a suitable  function  space,  such  that 

N-lli0o-1(2-9)ll2  + (p/N)Jm(g)  (2.1) 

attains  its  minimum.  Here  define 

z =>  (zi,...,zn)T,  g - (gi,...,gN)T,  91  3 g(t{),  | (Do"1 (z-g) | | 2 

3 (z-g)TD<j-2(z-g),  Do-1  - d1ag(a1-1 on"1)  , 

where  superscript  T means  transpose  throughout.  Also, 


•W9)  3 


M' 

£ 


m! 


A(t) 


V31  al,v!  »•••  »ad,v-,-a 


/.../  [- 


-]2dxi,...dxd;  (2.2) 


3xial,v 


3xdad,v 


I r* 
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- (X1 xd)T;  M*  - (md-I1 ) ; the  alfV,  ...,«d,v  are  the  H*  unique 

combinations  of  {0,1,. ...m}  such  that  a1>v+...+adtV  - m. 

In  the  case  presented  earlier  with  d*2  and  m«2  we  have  M -3  and 

(°l,v»a2,v)  takes  on  the  H‘  un1(lue  va1ues  t1*1)*  (2*°)  and  (°»2^  In  this 

case  (2.2)  reduces  to  (1.4). 

The  solution  to  the  minimization  problem  Is  unique  and  given  In  (2.3). 


N M 

q(t)  = I Ciem  dT^2m“d(lnT^)Ie(d)  + E dv«v(t)  . 
- i-i  ’ v-1 


(2.3) 


1=1  ' vml 
where  Ie  Is  the  Indicator  function  of  even  Integers,  that  Is  Ie(d)Ml.  for  d 

even  and  le(d)=0,  for  d odd; 

(_l)d/2H+m/(22m-Ud/2(m-i)!(m-d/2)!),  d even 

r(d/2-m)/(22mitd/2(m-l)!),  d odd 

and  are  the  polynomials  of  total  degree  less  than  m. 


0m,d 


(2.4) 


Plv  Pdv 

♦v(t)  a $v(xl»»** *xd)  3 X1  •••xd 

Here  the  $v  are  unique;  Piv  > °»  i-l»...»d  and  piv+...+Pdv  < v*1 

M - |m+d"^|.  Define  the  H by  d matrix  P to  have  1vth  element  Pjv 

2m- d > 0 and  (2.6)  holds. 


(2.5) 

* • • • 

Also, 


E av*v(tf)  ■ 0,  1-1,. ...N  Implies  av  » 0,  v«l,...,M  . (2.6) 

v=l 

(Condition  (2.6)  requires  that  the  matrix  T0  of  Section  5 step  (11)  be  of  rank 
H.)  c - (C1 cN)T  and  d - (di dN)T  are  obtained  by  solving  the  linear 

system 

(K  + po2D02)c  + Td  = z 


(2.7) 
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and 

TTc  - 0 . (2.8) 

In  (2.7)  K Is  the  N by  N matrix  with  ijth  element 
em>dTij^m"d(ln(T1 j))*e(d).  In  (2.7)  and  (2.8)  T Is  the  N by  H matrix  with 
ivth  element  $v(tj).  In  (2.7)  D02  Is  the  H by  N dlagnal  matrix  with  11th 
entry  oj2.  o2  Is  an  unknown  proportionality  constant  which  along  with  p Is 
absorbed  Into  X using  NX  ■ pa2  to  yield  (2.S)  from  (2.7). 

(K  + NXDa2)c  + Td  =•  z (2.9) 

The  approach  of  Harder  and  Desmarais  (1972)  provides  us  with  a physical 
interpretation  of  the  parameters  at  least  in  the  d*2  case.  p*NX<r2  Is  the 
plate  “rigidity"  which  Is  a constant.  The  value  of  p depends  on  the  material 
and  the  thickness  of  the  plate.  The  spring  constant  kj  is  equal  to  the 
reciprocal  of  the  variance  or  (ocj)-2.  The  "load"  at  the  point  Is 
Pj  «*  pcj  » (coj)“2rj  » kj r j , where  rj  Is  the  unnormalized  or  unsealed  residual 

at  that  point;  l.e.,  rj  * zj  - g(tj),  j=l N or  r * z - Kc  - Td. 

For  a discussion  of  a more  general  problem  and  the  derivation  of  the 
solution  the  reader  Is  referred  to  Wahba  and  Wendelberger  (1980).  We  note 
here  that  if  the  ej  are  not  independent  but  Instead  have  positive  definite 
covariance  matrix  proportional  to  £ then  D02  and  D0_1  are  everywhere  replaced 
by  £ and  the  symmetric  inverse  square  root  £"1/2  to  obtain  the  solution. 

To  this  point  we  have  assumed  knowledge  of  the  smoothness  parameter  X. 
However  it  is  generally  unknown.  Before  describing  a method  to  dynamically 
choose  x from  the  data  at  l.and  we  provide  an  example  to  exhibit  Its  influence 
on  the  LSS. 
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3.  Example  1 — Variation  of  the  LSS  with  X,  d«l. 

A company  which  makes  and  repairs  small  computers  wants  to  forecast  the 
number  of  service  engineers  that  It  will  require  over  the  next  few  years.  To 
do  this  requires,  among  other  things,  knowledge  of  the  length  of  a service 
call.  The  length  of  a call  Is  a function  of  the  number  of  components  within 
the  computer  which  must  be  repaired  or  replaced.  The  Information  In  Table  3.1 
was  collected  on  24  service  calls;  the  data  are  from  Chatterjee  and  Price 
(1977).  We  would  like  to  fit  a spline  to  the  data  in  order  to  forecast  the 
length  of  a service  call. 

We  fit  a spline  to  the  data  using  the  algorithm  given  in  Section  5.  The 
smoothness  parameter,  X,  Is  dynamically  chosen  from  the  data  using  the  method 
of  generalized  cross  validation  (GCV).  By  showing  the  Influence  of  X on  the 
LSS  of  this  example  we  hope  to  provide  a clearer  understanding  of  the  role  of 
GCV  In  choosing  the  smoothness  parameter.  The  results  of  the  following 
sections  will  be  easier  to  understand  with  this  example  In  mind.  Exactly  what 
the  GCV  choice  of  x is  will  be  presented  in  Section  4. 

Figure  3.1  shows  a plot  of  the  data  and  the  corresponding  spline  for  five 
different  values  of  X.  Because  there  are  only  24  observations  of  which  only 
17  have  unique  independent  variables  we  should  not  be  surprised  If  the  GCV 
estimate  (to  be  described  In  Section  4)  of  X,  which  is  a large  Semple  result, 
does  not  perform  well.  The  confidence  Intervals  are  calculated  using  method 
of  Wahba  (1981);  the  formula  used  for  their  computation  is  given  In  Example  2 
of  Section  6. 


ORIGINAL  PAGE  13 

OF  POOR  QUALITY. 


420 


TABLE  3.1 


EXAMPLE  1 - REPAIR  TIMES 


Length  of  Calls 
(Minutes) 

23 

29 

49 

64 

74 

87 

96 

97 
109 
119 
149 
145 
154 
166 
162 
174 
180 
176 
179 
193 
193 
195 
198 
205 


Units  Repaired 
(Number) 

1 

2 

3 

4 

4 

5 

6 
6 

7 

8 
9 
9 

10 

10 

11 

11 

12 

12 

14 

16 

17 

18 
18 
20 


MINUTES 
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Considering  the  brief  explanation  of  the  problem  given  here  the  GCV  < 

at 

choice  of  X,  as  used  in  Figure  3.1c,  seems  reasonable  to  use  In  predicting  the 

■»<> 

number  of  minutes  spent.  The  GCV  choice  of  X appears  to  be  ♦‘he  most  visually  * 

pleasing  and  consistent  with  how  we  would  expect  the  number  of  minutes  spent 
on  a service  call  to  be  related  to  the  number  of  computer  components  repaired. 


-e 

i 


T 


u 


h—i 
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4.  Generalized  Cross  Validation 

In  the  example  of  Section  3 the  smoothing  parameter  X is  unknown.  To 
determine  an  estimate  of  this  parameter  Craven  and  Wohba  (1979)  and  Wahoa  and 
Wold  (1979)  have  suggested  the  use  of  generalized  cross  Validation.  A short 
synopses  of  the  development  of  this  method  is  given  to  enhance  the 
understanding  of  It. 

The  method  of  cross  validation  (presented  here  as  related  to  LSS's)  is 
developed  in  response  to  the  question:  How  well  may  one  expect  LSS's  to 

predict  the  true  factional  value  g ( t ) at  some  point  t? 

Simple  cross  validation  (SCY)  suggests  predicting  the  true  functional 
values  of  data  different  from  that  used  in  th<*  analysis  to  assess  this 
predictive  ability  In  its  simplest  form  this  entails  dividing  the  sample 
Into  two  pieces  of  similar  size  using  one  section  for  optimization  and  the 
other  for  testing.  In  addition  to  this,  in  order  to  gain  more  information 
from  the  data,  the  two  pieces  nay  be  interchanged  and  the  optimization  and 
testing  perfornea  on  each. 

SCV  is  alright  if  there  is  an  ample  supply  of  data  so  that  halving  or 
doubling  it  has  little  effect  on  the  quality  of  the  estimator.  To  lessen  this 
effect  Mosteller  and  Tukey  (1963)  propose  single  cross  validation  (1CV), 
(called  ordinary  cross  validation  by  Wahba  (1979)),  which  is  described 
suitably  by  them  as  follows: 

“Suppose  that  we  set  aside  one  individual  case,  optimize  for  what  is 

left,  then  test  on  the  set-aside  case.  Repeating  this  for  every  case 

squeezes  the  data  almost  dry.  If  we  have  to  go  through  the  full 
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optimization  calculation  every  time,  the  extra  computation  nay  be  hard  to 

face.  Occasionally,  one  can  easily  calculate,  either  exactly  or  to  an 

adequate  approximation,  what  the  effect  of  dropping  a specific  and  very 

small  part  of  the  data  will  be  on  the  optimized  result.  This  adjusted 

optimized  result  can  then  be  compared  with  the  values  for  the  omitted 

individual.  That  is,  we  make  one  optimization  for  all  the  data,  followed 

by  one  repetition  per  case  of  a much  simpler  calculation,  a calculation 

of  the  effect  of  dropping  each  individual,  followed  by  one  test  of  that 

individual.  When  practical,  this  approach  is  attractive." 

To  describe  1CV  mathematically  we  require  some  notation.  Let  gx(-l)  be 

the  solution  uo  the  minimization  of  (2.1)  with  the  point  removed  from  the 

analysis.  Similarly,  D0(j)  is  the  N-l  by  N-l  matrix  composed  of  D0  with  its 

jth  rovf  and  coiumn  removed.  To  "test  on  the  set  aside  case"  we  require  that 

[(gx(j){tj)  - Zj)/oj]2  be  small.  "Repeating  this  for  every  case"  and 

ave-aging  to  yield  an  overall  test  gives 

Vm°(X)  = (1/N)  Z [(gX(j)(tj)  - Zjj/oj]2  . (4.1) 

j*l 

1CV  uses  the  X which  minimizes  Vra°(X). 

To  minimize  VmO(X)  directly  *s  not  a trivial  computational  matter.  For 
each  proposed  value  of  X a system  of  the  form  (2.8)  and  (2.9)  (of  order  N+fl-1 
instead  of  N+M)  must  be  solved  for  each  of  the  N values  left  out  of  the 
analysis.  This  entails  solving  a linear  system  of  order  N+K-l  N times!  As 
noted  earlier  "if  we  have  to  go  through  the  full  optimization  calculation 
every  time,  the  extra  computation  may  be  hard  to  face."  Following  the  idea  of 
Hosteller  and  Tukey  we  seek  a computational  simplification  for  the  mimmizer 
of  Vn<>(X). 
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The  simplified  form  for  1CV  was  first  noted  by  Craven  and  Wahba  (1979) 
and  given  In  a slightly  more  general  form  in  Wahba  and  Wendelberger  (1980). 

The  1CV  function  may  be  written 
N 

W)  - - Zj)/(°j(l-ajj(X)))]2  . (4.2) 

ajj(X)  is  the  diagonal  element  of  Am(X)  which  Is  defined  by 

(9x(tl) 

9x(tN) 

where  gx  Is  the  solution  of  (2.1).  An)( X)  may  be  thought  of  as  mapping  the 
vector  z Into  the  smoothed  values. 

In  this  form  "we  make  one  optimization  for  all  the  data"  by  calculating 
gx  then  "followed  by  one  repetition  per  case  of  a much  simpler  calculation,  a 
calculation  of  the  effect  of  dropping  each  Individual."  Here  find  ajj(x)  and 
use  (4.2). 

Evaluation  of  this  formulation  of  Vm°(X)  involves  solving  a linear  system 
of  size  N+M  to  find  gx  and  one  of  size  N to  find  ajj(X).  This  Is  a 
considerable  improvement  over  that  of  using  (4.1)  directly.  Because  of  a 
mathematical  simplification  the  amount  of  computation  needed  to  minimize 
Vm°(x)  can  be  substantially  reduced.  From  a practical  point  of  view  this 
makes  the  use  of  cross  validation  very  attractive. 

When  applying  cross  validation  to  problems  other  than  LSS's  this  last 
step  of  finding  "what  the  effect  of  dropping  a specific  and  very  small  part  of 
the  data  will  be  on  the  optimized  result"  Is  very  important  and  should  not  be 
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overlooked.  In  fact,  this  step  often  makes  cross  validation  computationally 
feasible  whereas  without  this  Insight  It  may  be  Impractical. 

Finding  the  mlnlmlzer  of  Vm°(X)  requires  Its  evaluation  at  different 
values  of  X as  determined  by  a search  routine.  Hence,  although  the 
minimization  Is  possible  we  need  to  repeatedly  solve  large  linear  systems  with 
the  number  of  solution  times  being  a function  of  the  search  routine  employed. 

In  Vm°(X)  of  (4.1)  each  deviation  of  gx'^(tj)  from  the  observed  value  zj 
is  treated  symmetrically.  This  choice  is  ?rt>-,trary  and  Is  chosen  for 
simplicity.  A more  general  approach  Is  to  weight  each  term  of  (4.1)  or 
equivalently  (4.2)  to  yield 
N 

W - (1/N)  l W1[(gx(ti)  - z1)/(ai(l-a1i(X)))]2.  (4.3) 

1 = 1 

Before  a discussion  of  the  choice  of  these  weights  the  following  definition  is 
needed. 

Definition: 

N 

MM  - E ( l/N)  r [(f(t,)  - gx(ti)}/0i]2 

i=l 

is  the  expected  weighted  (by  o-()  mean  squared  error  between  the  true  function 
(f)  and  the  spline  ( g x ) evaluated  at  the  independent  variables  (tj).  Here  E 
denotes  mathematical  expectation  with  respect  to  the  error  distribution  of  the 
random  errors  as  described  In  the  model  of  Section  2. 

If  we  want  Rm( X ) to  be  small  then  the  generalized  cross  validation  value 
of  X should  be  used  as  the  smoothing  parameter  value.  Using  1CV  as  motivation 
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Craven  and  Wahba  v.979)  and  Golub,  Heath  and  Wahba  (1979)  have  shown  that  the 
X which  minimizes  Vra(X)  with  weights 

N 

Wi  = (i-ai1(x))2/(i-rr1  t ajj(x))2 

j”l 

Is  an  estimate  of  the  X which  minimizes  Rm(X).  Us’lng  these  weights  In  (4.3) 
gives  the  generalized  cross  validation  function  (GCVF) 

N H 

Vm(M  - (1/N)  Z C(gx(tl)-zi)/(a1(l-N-1  Z ajj(X)))]2  . (4.4) 

1=1  ' j«l 

The  mlnimizer  of  (4.4)  Is  called  the  GCV  estimate  of  X. 

The  GCVF  can  be  rewritten  as 

Vm(M  = (l/N)||D0-l(I  - Am(X))z||2/((l/N)Tr(I-Am(X)))2  ; (4.5) 

where  Tr  Is  the  trace. 

Wahba  (1981)  has  proposed 

°e2  - I |D®"l(I“A,n(X))z| |2/Tr(I-Am(X))  (4.6) 

as  an  estimate  of  the  error  variance  a2.  This  leads  us  to  consider 

dfe  = Tr(I-Afl,(X))  as  the  degrees  of  freedom  of  error.  Using  these  notions  we 

rewrite  the  GCVF  as 

W * Nae2/dfe  . (4.7) 

The  method  of  GCV  may  be  viewed  as  minimizing  the  estimated  error 
variance  per  error  degrees  of  freedom.  This  may  further  be  thought  of  as  a 
form  of  parsimonious  model  selection. 
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In  the  next  section  we  see  that  the  computation  of  Vn(X)  is  reduced  to 
essentially  the  singular  value  (or  eigenvalue-eigenvector)  decomposition  of  a 
symmetric  positive  definite  N-M  by  N-M  matrix  (M  Is  usually  a small  Integer). 
The  above  decomposition  makes  It  possible  to  form  Vm(x)  by  simple  scalar 
operations  for  each  value  of  X.  Thus  we  have  taken  the  ideas  of  Hosteller  and 

Tukey  one  step  further.  This  algorithm  is  much  simpler  than  the  original 

analysis  at  essentially  the  cost  of  a one  time  eigenvalue-eigenvector 
decomposition;  l.e.,  changing  the  dependent  variable  (but  not  the  Independent 
variables)  does  not  necessitate  another  spectral  decomposition.  Thus,  many 
data  sets  which  have  identical  Independent  variables  but  different  dependent 
variables  may  be  analyzed  quite  easily  and  inexpensively. 

When  using  GCV  with  a small  sample  size  we  may  run  into  problems.  The 
most  frequent  small  sample  problem  with  GCV  Is  that  X » 0 or  X » ■ is  chosen 

when  physical  considerations  dictate  that  It  should  not  be.  X * 0 Implies 

that  we  are  interpolating  the  dependent  variable.  This  should  be  done  if  the 
true  underlying  rigidity  p is  zero.  X equal  to  infinity  implies  that  we  are 
fitting  a polynomial  of  degree  m-1  by  least  squares.  This  should  be  done  If 
either  the  variance  Is  large  (relative  to  the  dependent  variable)  or  if  the 
true  underlying  rigidity  is  Infinite  (l.e.,  the  true  model  is  a polynomial). 

If  it  is  clear  from  other  considerations  that  the  value  of  X chosen  is  not 
indicative  of  the  actual  underlying  mechanism  then  that  particular  value 
should  not  be  used  and  the  model  assumptions  should  be  checked  for 
violations. 

The  choice  of  m can  also  be  made  by  GCV,  see  Lucas  (1978)  and  Wahba  and 


Wendelberger  (1980) 
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5.  Algorithm 

The  user  must  supply  N Independent  variables,  tfeRd,  1*1 N,  and  their 

corresponding  dependent  variables,  z^cRd,  1*1, ...,N  to  compute  the  LSS  at  a 
point  teRd.  Assume  that  the  model  described  In  Section  2 holds.  In 
particular,  assume  the  Independent  variables  tj  are  known  without  error  and 
the  dependent  variables  Zi  consist  of  the  true  function  value  at  ti,  f(ti), 
plus  "noise,"  e-j , zj  * f(tj)  + e-j.  The  ei  are  Independent  with  finite 
variance  a2aj2,  o2  an  unknown  constant. 

To  produce  the  coefficients  c and  d needed  to  evaluate  the  spline  we 
solve  the  linear  system  of  equations 

(K  + NX*D02)c  + Td  * z 
and 

T^c  * 0 . 

In  this  system  x * Is  the  optimal  value  of  the  smoothing  parameter  X as 
determined  by  the  generalized  cross  validation  function.  If  X*  Is  known  then 
the  solution  of  the  above  linear  system  could  be  accomplished  for  relatively 
large  values  of  N.  However,  It  is  usually  unknown  and  must  be  calculated  in 
order  to  solve  the  system  of  equations. 

The  method  currently  used  to  determine  X*  requires  the  solution  of  a 
symmetric  N-M  dimensional  eigenvalue-eigenvector  problem.  This  Is  the  current 
computational  barrier  to  solving  problems  with  large  numbers  of  observations. 

The  algorithm  presented  In  Wahba  and  Wendelberger  (1980)  requires  the 
inversion  of  a matrix  of  order  M and  two  eigenvalue-eigenvector  decompositions 
of  symmetric  matrices,  one  N by  N and  the  other  (positive  definite) 
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N-M  by  N-H.  The  algorithm  presented  here  requires  the  solution  of  a 
triangular  system  of  order  M,  the  QR-decomposition  of  an  N by  M matrix  and  the 
singular  value  (or  eigenvalue-eigenvector)  decomposition  of  a symmetric 
positive  definite  N-M  by  N-M  matrix.  This  algorfthm  Is  faster  and  requires 
fewer  operations,  primarily  because  of  the  replacement  of  one  N by  N 
eigenvalue-eigenvector  decomposition  by  the  QR-decomposition  of  an  N by  M 
matrix  (M  < N). 

This  algorithm  provides  for  replicated  points.  A replicated  point  is  one 
for  which  there  Is  more  than  one  observation  of  the  dependent  variable  for  a 
particular  value  of  the  Independent  variable.  Let  the  total  number  of  unique 
(independent  variable)  points  be  Nfj  and  define  N0  * N - M - Nr.  Then  the 
computational  algorithm  is  as  follows: 

(1)  Compute  Ta  » Da~lT. 

(ii)  Perform  the  QR-decomposition  described  in  Dongarra,  et  al.,  (1979),  of 
Ta. 

T0  * (Ql.Q2)  x (RT.0)T  . 

(Ill)  Calculate  8 = Q21V’1KDa-1Q2  • 

(iv)  Decompose  B * (Ui.tlgJOg'  (Ui,U2)T  , 

using  the  singular  value  decomposition  of  B,  as  described  by  Golub 
and  Reinsch  (1970)  or  using  the  spectral  decomposition  of  B as 
described  by  Smith,  et  al . , (1976);  where 
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Dg'  - diagonal  matrix  of  the  eigenvalues  (bf)  of  B,  which  Is  of 
dimension  N-M  by  N-M, 

Dg  - diagonal  matrix  of  the  positive  eigenvalues  (b-| ) of  B,  which 
Is  of  dimension  Nr  by  Nr, 

Ul  - the  eigenvectors  of  the  positive  eigenvalues  of  B,  which  Is  of 
dimension  (N-M)  by  Nfj,  and 


U2  - the  eigenvectors  of  the  zero  eigenvalues  of  B,  which  Is  of 
dimension  (N-M)  by  N0. 

(v)  Form  w » Ui^Q2T0a-lz, 
w^  * (*1,...,*NN)  . 

(vl)  Obcaln  X*  as  the  mlnlmlzer  of 

” nn  nn 

N I Cwi/(bi/N  + X)]2/(  t (l/(bi/N+X)))2, 

1-1  1-1 

X * • and  N-M  - N^ 

nN 

N[z0tQ2Q2TZo  - wTw  + X2  z (wi/(bi/N+X))2]t  (5. 

1-1 

\VX)  - N^ 

(N-M-NN+X  Z ( 1 / ( b -j / N+  X ) ) ) 2 , 
i-1 

X * «•  and  N-M  * Nfl 
N z0TQ2Q2Tz&/(N-M)2,  X - - 


where  za  - Da-1z  . 
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(vll)  Calculate 

O0-1Q2UiDB“1UiT(}2T2a  , \ - o 

Do^QzUlCtDB+^IJ-^Ul^Zo  . 
c ■ 

0 < X < « and  N-M  » % 

(5.2) 

Da"1Q2UlC(DB+f^n"l-(NX)-ll]U1TQ2TZa 

+(NX)-1D0-1Q2Q2Tza  , 0 < X < - and  N-M  * NN 
0 , A = » . 


(vlii)  Solve  the  triangular  system. 

Rd  = QiTD0-l(z  - Kc)  for  d, 
dT  3 (dj,...  ,d|n)  . 

An  Important  aspect  of  this  method  Is  the  relatively  small  cost  of 
reconstructing  a new  LSS  using  the  identical  Independent  variables  while 
changing  only  the  dependent  variables.  To  see  this  notice  that  the  bulk  of 
the  computational  effort  Is  in  steps  (i)  through  (1 v)  which  do  not  require 
knowledge  of  the  dependent  variables.  These  steps  depend  upon  the  Independent 
variables  and  D0.  To  construct  a second  LSS  with  the  same  independent 
variables  and  identical  O0  we  need  only  save  the  matrices  Uj,  0o,  db,  Qlt  q2 
and  R.  With  these  matrices  we  perform  steps  (v)  through  (vlii)  to  produce  a 
spline  for  another  set  of  dependent  variables,  say  z\  with  little  additional 
computational  effort. 

The  fact  that  obtaining  another  spline  from  z'  is  easy  requires  further 
consideration.  It  is  made  possible  because  of  the  necessity  to  minimize  the 


i 


GCVF.  This  minimization  provides  the  mechanism  to  easily  calculate  c and  d In 
steps  (vll)  and  (vlli)  of  the  algorithm.  If  x*  was  somehow  known  a priori 
then  we  could  go  right  ahead  and  solve  the  linear  system  (2.8)  and  (2.9)  at  a 
much  less  one  time  cost.  However,  even  with  x*  known,  If  we  had  many  new  data 
sets  z'  then  for  some  number  of  them  It  Indeed  would  be  easier  to  do  the 
spectral  decomposition  once  and  for  all. 

Instead  of  saving  Ult  D0,  0B,  Qlt  Q2  and  R we  actually  save  Q2Ult  Dc,  DB, 
QlT0o-lK  and  the  QR-decomposItlon  of  T0  to  retrieve  R,  Q2Q2T  and  Ql*  By  us'ln9 
these  matrices  we  can  perform  steps  (v)  through  (vlii)  quite  Inexpensively. 

The  QR-decomposItlon  can  be  stored  In  the  storage  which  has  been  allocated  for 
Ta  plus  M additional  storage  locations.  QitD0’1k  is  retained  so  that  it  is 
unnecessary  to  reevaluate  K. 
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6.  Example  2—Franke‘s  Principal  Test  Function,  d»2. 

Example  2 Is  a Monte  Carlo  experiment  to  demonstrate  the  surface  (d*2) 
which  may  b?  obtained  by  using  an  LSS  with  GCV.  The  "principal  test  function" 
of  Franke  (1S79)  Is  used  as  the  true  function  f.  This  surface  consists  of  two 
Gaussian  peaks  and  one  Gaussian  dip  superimposed  on  a surface  sloping  towards 
the  first  quadrant.  The  surface  is  defined  by 

f(x,y)  * .75  exp  -CC(9x-2)2+{9y-?)2]/4] 

+ .75  exp  -[[(9x+l)2/  49]+C(9y+l)/10]l 
+ .50  exp  -C[(9x-7)2+(9y-3)2]/4] 

- .20  exp  -[(9x-4)2+(9y-7)2] 

A plot  of  the  surface  f Is  given  in  Figure  6.1. 

The  surface  is  reconstructed  from  169  "noisy"  observations  on  the  grid 
2j-l  2k-l 

G « (t1|t1-( , ),  1a13(j-l)+k;  j,k«l 13}  . 

- ' 26  26 

The  "noisy"  observations  are 

Zj  = + e^  with  ei*N(0,a2),  i*l,...,169,  cz=(.03)2. 

The  e,  are  generated  by  the  pseudo  random  number  generator  RAENBR  at  the 
Madison  Academic  Computing  Center,  MACC  (1978).  The  LSS  with  m=2  and  the 
smoothing  parameter  chosen  by  GCV  is  plotted  in  Figure  6.2.  The  closeness  of 
fit  can  be  qualitatively  seen  by  overlaying  Figure  6.2  on  Figure  6.1. 

For  this  example  the  calculated  ce2=(.026)z,  (using  (4.6)),  compares 
favorably  with  the  true  c2=(.03)2.  Using  ae2  to  obtain  confidence  intervals 
for  the  true  curve  at  the  grid  points  G as  in  Wahba  (1981)  gives  the  95% 
confidence  intervals 
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Figure  6.1:  Example  2— Frank  's  Principle  Test  Function 
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Figure  6.2:  Plot  of  the  m = 2,  GCV  X,  spline  fit  to  France's  Principal 

Test  Function  from  169  "noisy"  points  with  o = .03 
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9\*(ti)  t 1.96oeoi (aj j (X*))l/2  , 1»1 N. 

Figure  6.3  gives  the  cross  section  along  the  grid  showing  the  true  curve, 
spline  fit,  observation  and  9£X  confidence  Interval  at  each  point  for  each 
value  of  xj , 1*1 13. 

The  number  of  95S  confidence  Intervals  which  cover  the  true  surface  is 
known  because  the  true  surfacp  Is  known.  For  this  example  162  or  95. 9%  of  cne 
intervals  cover  the  true  surface.  Yhls  Is  a favorable  comparison  since  the 
expected  number  is  161.  This  example  was  not  chosen  because  of  this  agreement 
but  rather  was  the  only  one  run  by  prior  decision. 

The  example  given  here  uses  points  on  a grid  only  for  clarity  of  display. 
For  other  d*2  Monte  Carlo  results  see  Wahba  and  Wendelberger  (1980).  The 
meteorological  example  given  there  uses  Irregularly  spaced  points. 
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Figure  6.3:  13  Cross  sections  of  Example  2-true  curve  (solid  line),  spline  fit  (dashed  line) 
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7.  Example  3— Deri  vatlves  and  Outliers,  d»3. 

Example  3 Is  a Monte  Carlo  experiment  with  d«3  and  true  function 

f(xi,x2,x3)«(2n)-3/2  exp  [(xi2+4x22+9x32)/(-2)]. 

Contours  of  f,  f and  f"  are  given  as  the  solid  lines  In  Figures  7.4,  7.5  and 
7.6. 

Three  hundred  points  tl,  1»1,...,300  are  taken  from  a uniform 
distribution  in  Ra{(xi,x2,X3) J -2<xj<2 , -1<X2<1,  -2/3<X3<2/3 } . The  true 
function  f is  evaluated  at  each  of  the  points  tj  and  added  to  a Gaussian 
pseudo  random  variable  with  standard  deviation  a».0025  to  yield  observation 
Zf.  The  peak  height  of  f Is  approximately  .0634.  a Is  roughly  4%  of  the  peak 
height  and  therefore  these  data  have  a "typical'*  noise  level. 

A value  of  m“4  was  chosen  for  this  example  In  order  that  the  second 

derivative  of  the  spllr.e  could  be  used  as  an  estimate  of  the  second  derivative 

of  f.  If  k is  the  order  of  the  derivative  desired  then  2m-2k-d  must  be 

• • 

positive.  Here  2x4-2x2-3  = 1 > 0 and  so  the  second  derivative  of  the  LSS  will 
be  a good  estimate  of  the  second  derivative  of  f;  for  details  see  Wahba  and 
Wendelberger  (1980). 

The  estimate  oe  for  this  experiment  Is  .0024  which  agrees  nicely  with  the 
true  value  of  .0025. 

Contours  of  the  true  function  and  the  fitted  spline,  g\*,  are  plotted  in 
Figure  7.4  for  4 values  of  X3.  Because  of  the  symmetry  of  the  true  surface  it 
was  not  plotted  for  negative  values  of  X3.  The  true  function  and  the  fitted 
spline  are  close  to  one  another  near  the  center  of  the  region  ard  this 
closeness  degrades  as  we  approach  the  boundary  in  each  of  the  three 
directions. 
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Figure  7.4:  Example  3 — solid  line  is  f,  dashed  line  Is  g 
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The  contours  of  the  derivatives  of  f and  gy*  with  respect  tc  xj,  X2  and 
X3  are  given  In  Figures  7.5a,  7.5b  and  7.5c,  respectively.  The  contours  of 
the  second  derivatives  of  f and  gy*  with  respect  to  xjxi,  xix2,  XIX3,  X2X2, 
X2X3  and  X3X3  are  given  In  Figures  7.6a,  7.6b,  7.6c,  7.6d,  7.6e,  and  7.6f, 
respectively.  The  same  qualitative  behavior  Is  displayed  by  these 
derivatives  as  of  the  function  with  the  degradation  occurring  relatively  more 
rapidly  as  the  boundary  is  approached.  Figure  7.6f  which  is  ( 3 2)/( 3x33x3)  of 
f and  gy*  displays  a particularly  good  fit  near  the  center  of  R. 

ISS's  may  oe  utilized  to  detect  outliers  In  multidimensional  noisy  data 
provided  that  the  model  of  Section  2 Is  (nearly)  appropriate.  The  model 
requires  that  the  observations  are  unbiased,  i.e.,  that  Ez*f.  The  errors 
should  be  additive  and  have  a known  relative  error  'structure,  D0.  For  the 
purpose  of  the  outlier  study  here  we  shall  further  assume  that  each  error  ej 
has  a Gaussian  distribution. 

To  what  extent  the  assumption  of  normality  may  be  relaxed  in  practice 
requires  further  study.  The  smoothness  assumption  requires  that  f(t)  is 
a smooth  function  of  t.  This  rules  out  “cliff"  functions  or  those  with 
discontinuities.  By  using  a probability  plot  of  the  residuals  the  example 
discussed  here,  which  satisfies  the  above  requirements,  will  be  used  to 
demonstrate  an  outlier  detection  method. 

Data  sets  with  outliers  need  to  be  constructed.  To  accomplish  this 
choose  the  two  points  of  t^ , 1*1,. ..,300  which  are  nearest  to  and  farthest 
from  the  origin,  which  Is  the  center  of  the  data  region.  These  two  points  are 
ty  = (-.056,  -.032,  -.042)  and  t]  * (1.985,  -.879,  -.325),  respectively.  To 
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construct  data  sets  z^j,  let  each  element  of  z^j  equal  the  corresponding 
element  of  z except  for  the  kth.  The  kth  element  is  set  equal  to  f (t k)  - so, 
c*.0025.  Construct  Z]s  analogously  except  that  the  1^  element  becomes 
f(t])  + so. 

With  the  data  sets  Z|<s  and  *ls  probability  plots  in  Figures 
7.7  and  7.8  were  obtained  with  MINITAB,  Ryan,  Joiner  and  Ryan  (1976).  The 
probability  plot  is  constructed  by  ordering  the  residuals  r^  from  smallest  to 
largest  and  plotting  them  against  their  corresponding  normal  scores.  The  1th 
smallest  normal  score  as  used  by  MINITAB  is  the  ( i-3/8)/300. 25  percentage 
point  of  the  normal  or  Gaussian  distribution.  If  the  error  distribution  that 
is  postulated  in  the  model  is  the  correct  one,  then  the  probability  plot 
should  be  nearly  linear.  In  the  data  sets  constructed  here  the  error 
distribution  Is  not  correct  because  the  kth  or  1^  point  Is  biased  and 
contains  no  random  component. 

The  numbers  in  Figures  7.7  and  7.8  Indicate  how  many  points  are  plotted 
at  that  spot  on  the  graph.  An  asterisk  indicates  one  point  and  a plus  sign 
indicates  that  more  than  9 points  are  overlapping.  In  Figures  7.7b,  c and  d 
the  outlier  is  Identified  as  tne  point  which  is  separate  from  the  points  which 
form  the  line.  As  the  assumption  of  unbiasedness  Is  more  strongly  violated  it 
shows  up  more  obviously  in  the  plot. 

Figures  7.8a-d  demonstrate  that  this  outlier  detection  scheme  is  not 

invincible  and  should  be  used  in  conjunction  with  other  diagnostic  checks. 

The  point  tj  has  very  high  leverage  because  it  is  on  the  boundary  of  the  data 
**  * 

region.  In  linear  regression  this  is  analogous  to  the  points  at  the  extremes 
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Figure  7.7a:  Residuals  vs.  normal  scores  for  one  outlier,  f(tfc)  - Os,  at  t^. 
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Figure  7.7b:  Resdiuals  vs.  normal  scores  for  one  outlier,  f(tk)  - 6c,  at  tk. 
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Figure  7.7c:  Residuals  vs.  normal  socres  for  one  outlier,  f(tfc)  - 10a,  at  t*. 
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Figure  7.7d:  Resdiuals  vs.  normal  scores  for  one  outlier,  f(t^)  - 20a,  at  t|<. 
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Figure  7.8a:  Residuals  vs.  normal  scores  for  one  outlier,  f(t])  + Oo,  at  tj. 
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Figure  7.8b:  Resdiuals  vs.  normal  scores  for  one  outlier,  f(t])  + 6o,  at  t] 
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7.8c:  Residuals  vs.  normal  scores  for  one  outlier,  f(tj)  + lOo,  at  t], 
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Figure  7.8d:  Resdiuals  vs.  normal  scores  for  one  outlier,  f(tj)  + 20a,  at  t] 
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of  the  Independent  variable  range  which  also  have  high  leverage.  Because  of 
this  the  residual  at  tj  Is  not  large  and  does  not  show  up  in  the  probability 
plots  of  Figures  7.8a-d.  The  leverage  at  t]  Is  so  large  that  It  causes 

another  point,  the  one  In  the  lower  left.  In  Figure  7.8d  to  appear  as 

descrepant.  The  probability  plot  provides  a technique  to  check  model 
assumptions.  However,  as  demonstrated  here,  this  technique  should  be  used  In 

conjunction  with  other  diagnostic  checks  and  with  a good  understanding  of  the 

pitfalls  which  may  be  encountered. 

Another  diagnostic  check  which  may  be  employed  here  Is  to  plot  the 
residuals,  rj , against  the  distance  from  tj  to  t].  This  is  analogous  to 
plotting  the  residuals  against  the  independent  variable  in  simple  linear 
regression.  If  a nonrandom  pattern  is  observed,  such  as  serial  correlation, 
then  we  have  evidence  that  some  model  assumption  Is  being  violated.  In 
practice,  t]  Is  unknown  and  hence  it  may  be  necessary  to  do  all  possible 
plots,  1*1 N. 

If  a scaling  Da  had  been  used  then  the  scaled  residuals  D0~lr  would  be 
plotted  instead  of  r. 

The  procedure  described  here  Is  a diagnostic  method  by  which  some  of  the 
model  assumptions  may  be  checked.  Irregularly  spaced  multidimensional  "noisy" 
data  easily  mask  outliers.  This  technique  provides  a means  which  may  detect 
these  discrepant  observations.  It  is  presented  here  In  the  hope  that  it 
becomes  a routine  method  to  check  for  model  violations  in  an  analysis  which 
uses  LSS's. 

The  three  dimensional  results  presented  here  are  new  and  quite  promising. 
A quantitative  measurement  of  the  goodness  of  fit  of  the  estimated  spline  and 
its  derivatives  to  the  true  function  is  given  in  Wendelberger  (1981).  Further 
Monte  Carlo  experiments  will  be  performed  in  3 and  more  dimensions. 
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8.  Running  the  program 

To  evaluate  an  LSS  at  any  point,  teRd  Involves  the  execution  of  two 
computer  programs.  The  first  of  these,  called  MAIN,  produces  the  coefficients 
of  the  spline.  The  second,  called  EVALUATE,  produces  the  spline,  9N,m,x(£)» 

If  2m-2k-d  is  positive  EVALUATE  may  also  be  used  to  produce  the  first  (k=l ) or 
second  (k=2)  derivative  of  it,X*  Depending  upon  the  particular  problem  at 
hand  the  user  specifies  different  options  to  be  exercised  by  the  program. 

k 

These  options  will  be  explained  card  by  card  oelow.  Card  1 will  be 
abbreviated  Ci  and  the  commands  are  summarized  in  Table  8.1  with  an  example 
runstream  given  in  Table  8.3. 

Cl  Is  used  to  specify  whether  or  not  the  coefficient  arrays  c and  d and 
the  matrices  X and  P used  to  reconstruct  the  spline  are  written  to  unit  13.  X 
contains  the  values  of  the  Independent  variables  and  P contains  the  exponents 
of  the  polynomials  in  (2.5),  where  P is  rigorously  defined. 

To  accomplish  storing  the  spline  in  unit  13  Cl  should  have  SS13  in 
columns  1 through  4.  If  EVALUATE  is  not  going  to  be  run  then  the  contents  of 
unit  13  will  be  unused.  In  this  case  Cl  should  be  PONT. 

Someone  other  than  the  casual  use'*  may  require  other  arrays  and  matrices 
which  are  also  written  to  unit  13.  See  subroutine  WRT13  in  Wendelberger 
(1981)  for  details  on  the  arrays  and  matrices  which  are  written  to  unit  13. 

C2,  to  be  described  in  the  next  paragraph,  writes  into  unit  14.  See 
subroutines  AWRT14  and  BWRT14  to  determine  the  specific  values  which  are 


written  to  unit  14. 
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TABLE  8.1 


Input  for  MAIN 


CARD 


POSSIBLE  VALUES 


FORMAT 


1 SS13,  DONT 

2 SM14,  UM14,  DONT 

3 SR15,  SP15,  VL15,  DONT 

4 MGCV,  USEL 

4+  (X)(Insert  If  C2  is  USEL.) 

5 VAR  I,  STAN,  SAME  (Omit  if  C2  is  UM14.) 

6 (d.N.m)  (Omit  If  C2  is  UM14.) 

7 Format  of  cards  C8+1,...,C8+N. 

8+1  (zi,  tiT,  oi  or  oi2) 

. (Zf)  (If  C2  is  not  UM14. ) 

(zj,  tiT,  oi  or  oj2)  (If  C5  is  STAN  or  VARI.) 
. Format  is  provided  on  C7. 

8+N  (zN,  tNT,  on  or  on2) 

9 YES,  NO 


A4 

A4 

A4 

A4 

E15.8 

A4 

315 

18A4 


(See  C7) 


A4 
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C2  provides  the  ability  to  store  certain  matrices  In  unit  14  by  using 
SM14  In  columns  1 through  4.  The  storage  of  these  matrices  makes  It 
unnecessary  to  perform  the  bulk  of  the  computations  If  a second  analysis  Is  to 
be  performed.  However,  only  the  dependent  variables  may  be  changed  for  such  a 
subsequent  analysis.  The  relative  variances  or  standard  deviations  must  be 
Identical  to  the  run  which  used  SM14  on  C2. 

UM14  in  the  first  four  columns  of  C2  provides  for  use  of  the  matrices 
which  have  previously  been  stored  In  unit  14.  If  the  value  of  C2  is  PONT  then 
the  matrices  are  neither  stored  nor  used. 

C3  provides  a means  to  retrieve  certain  information  during  the  execution 
of  MAIN  and  to  store  this  Information  In  unit  15.  The  first  four  columns  of 

t 

C3  must  'be  SR15,  SP15,  VL15  or  PONT.  If  C3  Is  SR15  the  residuals 
r 8 (z~9N,m,x(t) ) are  stored  In  unit  15  with  the  format  (G24.18).  If  C3  Is 
SP15  the  ordinate  and  abcissa  for  each  point  of  the  plot  of  the  GCVF  as  given 
In  the  output  are  stored.  First  the  number  (n)  of  pairs  is  stored  in  15 
format  followed  by  the  ordered  pairs  (1  ,ln(V(lla1+b))),  where  1 Is  an  index 
number  i=l,...,n  and  In  Is  the  natural  logarithm;  the  format  used  is 
(I3.G24.18).  If  C3  is  VL15  then  bi/N,  1»1,...,N-M  with  format  (G2*.18) 
followed  by  w with  with  the  same  format  are  stored.  If  none  of  the  above 
are  to  be  stored  then  C3  should  be  PONT. 

The  value  of  HGCV  on  C4  causes  the  GCVF  to  be  minimized  to  determine  x*. 
If  the  user  wants  to  supply  a value  of  X then  the  value  of  C4  should  be  USEL. 
In  that  case  C4+  is  used.  C4+  should  contain  the  value  of  X in  (E15.8)  format 
to  be  stored  in  a single  precision  variable.  If  C4  Is  MGCV  then  C4+  should 
not  be  included  in  the  input  stream. 
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C5  Is  not  used  if  the  value  of  C2  is  UM14.  Otherwise  C5  is  used  to  input 
relative  variances  or  relative  standard  deviations  or  neither  of  these  for  the 
errors  of  the  dependent  variable.  If  the  relative  variances  are  to  be  read 
then  C5  should  be  VAR I;  if  the  relative  standard  deviations  are  read  then  C5 
Is  STAN;  and  If  neither  is  read  then  C5  is  SAME.  The  value  SANE  Is  equivalent 
to  that  of  entering  all  l's  as  the  relative  variances.  However,  if  SAME  is 
used  then  the  program  circumvents  both  multiplication  and  division  by  I since 
Da  is  simply  the  identity  matrix. 

C6  is  not  used  if  C2  is  UM14.  Otherwise  C6  reads  in  the  number  of 
independent  variables  (=»  dimension),  the  number  of  observations  N and  the 
value  of  m to  be  used.  The  format  used  Is  (315). 

C7  contains  the  format  to  be  used  to  read  in  the  data  values.  The  format 
should  require  at  most  72  spaces  Inclnj  j t.  e left-  and  right-most 
parentheses. 

The  data  follow  In  „>•!  through  Ct'+N.  The  data  should  be  real  fortran 
variables,  each  data  line  should  contain,  in  order,  the  dependent  variable, 
the  independent  vanaole(s)  and  the  relitive  variance  or  standard  deviation  if 
used.  If  C2  is  l'H14  then  C8+1  through  C8+N  , -old  contain  only  the  dependent 
variables.  They  snculd  ou  * .-en  the  identical  sequence  as  the  dependent 
and  Independent  vanable(s)  were  when  2 had  the  value  SH14. 

The  last  card  to  be  read  is  C9.  It  should  contain  one  of  the  values  YES 
or  NO  . If  YES  then  experimental  confidence  intervals  are  provided  along 
with  degrees  of  freedom  and  an  estimate  of  the  variance  (Wahba,  (1981)).  If 
NO  then  these  values  are  neither  computed  nor  printed. 


To  evaluate  the  spline  (k  = 0),  first  derivative  (k  ■ 1)  or  second 
derivative  (k  * 2)  the  program  EVALUATE  Is  used.  Previous  to  running  EVALUATE 
the  program  MAIN  must  have  been  run  with  Cl  writing  the  coefficients  to  unit 
13  (Cl  must  have  been  SS13).  EVALUATE  will  then  read  the  matrices  from  unit 
13  and  calculate  the  spline,  Its  first  derivative  or  second  derivative.  The 
kth  derivative  (k  * 1 or  k * 2)  will  be  calculated  only  if  2m-2k-d  is  greater 
than  0.  A description  of  the  input  stream  for  EVALUATE  is  given  in  Table 
8.2  with  a sample  runstream  given  in  Table  8.3. 

Cl  contains  two  Integer  values  in  (215)  format.  The  first  integer,  N‘, 
specifies  the  number  of  points  teRd  at  which  the  function  is  to  be  evaluated. 
The  second  integer  should  be  one  of  0,  1 or  2 depending  upon  whether  the 
spline,  first  or  second  derivative,  respectively,  Is  to  be  calculated. 

The  second  card  contains  the  format  to  be  used  to  read  <n  the  N*  points. 
The  format  should  require  at  most  72  spaces,  Including  the  left-  and 
right-most  parentheses.  The  Independent  variables  are  read  line  by  line  in 
the  same  sequence  as  that  which  was  used  to  calculate  the  coefficients. 

C3  must  be  either  SV15  or  PONT.  To  store  the  values  in  unit  15,  C3 
should  be  SV15.  This  causes  the  values  followed  by  the  corresponding 
independent  varlable(s)  to  be  written  to  unit  15.  If  C3  is  PONT  then  the 
values  are  not  written  to  unit  15. 

C3+  is  used  only  If  C3  is  SV15.  Then  C3+  should  have  the  format  which  is 
to  be  used  to  write  the  calculated  value(s)  followed  by  the  independent 
variable(s)  into  unit  15.  This  format  may  have  at  most  72  spaces  includir.  • 
both  the  left-  and  right-most  parentheses. 
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TABLE  8.2 


Input  for  EVALUATION 


CARD 

POSSIBLE  VALUES 

FORMAT 

1 

(N\k) 

215 

2 

Format  to  read  C4+1,...,C4+N'. 

18A4 

3 

SV15.D0NT 

A4 

3+ 

Format  for  15  (Omit  If  C2  is  DON'T.) 

18A4 

4+1 

• 

Independent  variable  points 

• 

of  evaluation,  t^. 

(See  C2) 

• 

Format  is  provided  in  C2. 

4+N' 


( 
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TABLE  8.3 


Sample  Runstreams 

@XQT  SMOOTH*SPLINE.MAIN 

SS13 

SM14 

DONT 

MGCV 

SAME 

1 24  2 

(F3.10,33X,F4.0) 

13 ADO  DATA. 

YES 


Comments 

Implements  the  MAIN  program. 

Stores  tne  spline  coefficients  in  unit  13. 
Stores  matrices  in  unit  14. 

Doesn't  store  other  values. 

Minimize  the  GCVF  to  determine  X*. 

The  relative  variances  are  all  the  same. 
One  dimension,  24  observations,  m=2. 

Format  of  the  input  data. 

Inserts  data  from  Table  3.1  in  runstream. 
Provide  confidence  Intervals. 


@XQT  SM00TH*SPL I NE. EVALUATE 
200  0 
(36X.F8.4) 

SV15 

(2E15.8) 

@ADD  PLOTDATA, 


Implements  the  EVALUATION  program. 

At  200  points  evaluate  the  spline. 

Format  of  the  Independent  variables. 

Store  the  spline  and  independent  variable 
values  in  unit  15. 

Format  of  above. 

Inserts  abcissa  points  to  be  used  for 
plotting. 


(3X0T  SMOOTH*SPLINE.MAIN 

SS13 

UM14 

DONT 

USEL 

.00016E00 

(F3.0) 

(3 ADD  DATA. 

YES 


Implements  the  MAIN  program. 

Stores  the  spline  coefficients  in  unit  13. 
Uses  the  matrices  stored  in  14  by  MAIN 
above. 

Doesn't  store  other  values. 

Use  the  following  value  of  X. 

Value  of  X to  be  used. 

Format  of  the  dependent  variables. 

Inserts  data  from  Table  3.1. 

Provides  confidence  intervals. 


<3XQT  SMOOTH*SPLINE. EVALUATE 
200  0 
(36X.F8.4) 

SV15 

(2E15.8) 

OADD  PLOTDATA. 


Implements  the  evaluation  program. 

At  200  points  evaluate  the  spline. 

Format  of  the  independent  variable. 

Store  the  spline  and  independent  variable 
in  15. 

Format  of  above. 

Inserts  abcissa  points  to  be  used  for 
plotting. 
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C4+1  through  C4+N'  contain  the  Independent  variable(s)  at  which  the 
function  is  to  be  evaluated.  These  should  be  in  the  format  given  on  C2.  The 
independent  variable(s)  should  be  in  the  same  sequence  as  used  to  obtain  the 
coefficients  with  the  program  MAIN. 

The  programs  MAIN  and  EVALUATE  are  written  in  ASCII  FORTRAN  Level  9R1  and 
are  running  on  the  UNIVAC  1100/80  computer  at  the  University  of  Wisconsin. 

All  calculations  are  performed  In  double  precision. 

The  subroutines  used  by  the  programs  MAIN  and  EVALUATE  are  named: 

AWRT14,  BWRT14,  CALC,  CALD,  CALRES,  CHECKQ,  COLOFK,  CONINT,  DATAR,  DERIV1, 
DERIV2,  E,  EDI,  ED2,  GETASI,  GET8M,  GETR,  GETRDE,  GETTHM,  GRAPHV,  MAKEB, 
MAKETS,  MINVL1 , MINVL2,  MQRDC,  PRINT,  PRNTLM,  RCHECK,  READ13,  SPLINE,  SVDB, 
VARDF,  VLHELP,  VOFL,  WHATDO , WRT13,  AND  WRT15.  GRAPHV,  MINVL1  and  MINVL2  are 
modeled  after  similar  subroutines  of  the  one  dimensional  smoothing  spline 
program  written  by  Fleisher  (1979)  and  running  at  the  Madison  Academic 
Computing  Center  (MACC).  A description  of  the  program  structure  Is  given  in 
Wendelberger  (1981). 

The  following  LINPACK  subroutines  are  also  used  by  the  program  MAIN: 
OAXPY,  DCOPY,  DDOT,  DNRM2,  DQROC,  DQRSL,  DROT,  DROTG,  DSCAL , DSVDC,  DSWAP  and 
DTRSL.  The  code  for  these  routines  Is  not  included  here.  It  may  be  found  in 
the  LINPACK  USERS'  GUIDE  by  Dongarra,  Bunch,  Moler  and  Stewart  (1979).  One 
modification  is  made  in  the  LINPACK  subroutine  DSVDC:  the  parameter  MAXIT  is 

increased  from  30  to  60.  This  parameter  sets  the  maximum  number  of  iterations 
to  be  performed  in  the  algorithm  to  determine  the  singular  values  and  vectors 
of  B before  termination  due  to  nonconvergence.  Increasing  MAXIT  to  60  is 


I 
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necessary  because  with  large  N,  say  N > 140,  30  iterations  may  not  be  large 
enough  for  some  problems.  An  example  with  N«150  failed  because  MAXIT=30  was 
too  small.  However,  with  MAXIT=60  example  3 with  N=300  was  successfully  run. 
In  fact  MAXIT*60  has  proved  ample  for  all  examples  tried  to  date.  The  version 
of  the  program  described  here  uses  the  singular  value  decomposition  to  obtain 
the  spectral  decomposition  of  B.  A new  modified  version  uses  the  EISPACK 
(Smith,  et  al.,  (1976))  routines  DTRED2  and  DTQL2  to  accomplish  this  task  at  a 
much  reduced  cost  and  at  no  loss  In  accuracy.  This  Is  because  the  singular 
value  decomposition  does  not  make  use  of  the  symmetry  of  B.  The  EISPACK 
routines  do  make  use  of  the  symmetry  of  B and  thus  the  cost  of  the 
decomposition  Is  roughly  cut  In  half. 
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ABSTRACT 


Lei  p r)  be  a meteorological  field  of  interest  \a\.  height  temper  itmc  aicorrponent  of  the 

wind  field  etc  We  suppose  that  data  , concerning  the  field  of  the  form  4»,  **  i.  4>  ♦ «,  are 

given  where  each  L,  is  an  arbitrary  continuous  linear  functional  and  < is  j mcoAtremenl  error  The 
data  4\  rrm  be  the  result  of  theory.  direct  measurements  remote  soundings  or  ax:*mnination  of  these 
We  develop  a new  m ithcmatic.il  formalism  exploiting  the  method  of  Generalized  ( rivssValidation  (GCV), 
and  some  recently  developed  optimization  results  for  analyzing  this  d u i Ihe  mwKzed  held  4>x<«, 
b the  solution  to  the  minimization  problem  bind  4*  in  a suitable  spice  of  function*  to  minimize 


where 


V (£.,$>  - <b  ) a • + kJmi<t>) 


m* 

oi'or'aj’a/ 


d"4> 


(t\a,0\u‘dpn  *Hnt 


(!) 


Functions  of  </  = I 2 or  1 of  the  four  variables  rip/  are  also  considered  THc  approach  can  be 
used  to  aralyze  temperature  fields  from  radiosonde  measured  umper  nurex  ant1  satellite  radiance 
measurements  %u»u!iatw*tu\l\  *o  incorporate  the  geostrophic  w ind  approximation  aid  other  information 
In  a test  of  the  method  (ford  = 2)  simulated  <00  mb  height  data  were  obi  lined  .ttdls^rete  points  cor- 
responding to  the  L S radiosonde  networK  bv  using  .in  analytic  representation  of  a <00  mb  wave  and 
superimposirg  realistic  random  errors  The  analvtit  representation  w ic  ru.>v  nil  cm  a line  grid  with 
what  appe  ir  to  be  impressive  results  An  explicit  represent  ition  for  the  nnrm  i/er  otdq  I ll  is  found  and 
used  as  the  btsix  for  a d.rect  (as  opposed  to  iterative)  numerical  ilgorithm  wiuJft  is  accurate  and 
efficient  for  \ somewhat  less  than  the  high  speed  stor  tge  capacitv  of  the  computer  The  icsults  extend 
those  of  Sasihi  and  others  in  several  direc'ions  In  particular  no  Stirling  gucsccw  an. I no  preliminary 
interpolation  of  .he  data  is  required,  and  it  is  not  necessarv  to  solve  a bound irv  ullic  problem  or  even 
assume  boundary  conditions  to  obtain  a solution  Different  tvpes  of  data  cm  be  vimhined  in  a natural 
way  Prior  climatologicallv  estimated  covariances  are  not  used  This  method  ma\  hcvtluu&Jit  of  ax  a verv 
general  term  ot  low-pess  filter  The  parameter  A controls  the  half  power  point  of  tta  implied  data  filler 
while  m controls  the  rate  of  roll  off  of  the  power  spectrum  of  the  analyzed  field  From  another  point 
of  view  A and  m play  the  roles  of  the  most  important  fre;  p trimeter*  in  m nmplrau  pnor  ion  jnance 
The  correct  choice  of  the  parameter  A and  to  some  extent  nt  is  important  1 hc'C  panmeterx  are  estimated 
Jrom  the  data  hemg  ana^ztd  by  the  GCV  method  This  method  estimates  a and  #»iSbr  which  the  implied 
dita  filter  has  maximum  internal  predictive  capability  This  cap  hility  is  assesscdHv  the  CCV  method 
by  impficitlv  leaving  out  ore  data  point  at  a time  and  determining  how  well  the  musing  djtum  can  be 
predicted  from  the  remaining  data  The  numerical  algorithm  given  provides  lor  thceffiuent  calculation 
of  the  optimum  A and  m 


I.  Introduction 

Sasaki  (1960)  introduced  the  idea  of  numerical 
variational  analysis  for  the  objective  analysis  of 
meteorological  fields  In  the  most  general  form  of 
variational  analysis  considered  here  wc  seek  a func- 


• Resorch  supported  by  the  Office  of  Naval  Research  under 
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tion  p.n  of  fom  variables  representing  a 

meteorological  field  of  interest  say  height,  tem- 
perature or  a componuttuf  the  wind  field  as  a func- 
tion of  ground  projection  coordinates  (v,v).  the 
vertical  coordinate p antlismei  1 his  function  should 
he  suitably  close  to  theilmght.  temperature  or  wind 
held  as  measured  at  a finite  set  of  positions,  pressures 
and  times,  it  should  rcllwit  known  behavior  of  such 
fields,  and  it  should  be  ‘smooth  in  some  sense 
Hi i an  example  of  known  behavior,  if  wc  fix  p at 


' 1 i 
- ' i 
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500  mb.  then  *l>  is  the  500  mb  geopotenti.il  height 
Letting  <1>  = <K  ».v, />„./),  then  the  sum  of  the  tend- 
ency and  horizontal  advection 

34 >/3/  + cy(0  4>/3t)  + c„(34>/3>) 

should  be  small,  where  < x and  c„  arc  the  r and  v 
components  of  the  wind  velocity  Sasaki  and  others 
have  incorporated  weak  (i  c . approximate)  and 
strong  ti  c , exact)  constraints  involving  the  tend- 
ency. the  advection.  the  geoxtrophic  wind,  balance, 
horizontal  momentum,  adiabatic  energy,  and  the 
hydrostatic  and  continuity  equations  (Sasaki,  1971; 
Lewis,  1972;  Lewis  and  Grayson,  1972,  Achte- 
meier,  1975) 

Using  the  sum  of  the  tendency  and  advection  as 
a weak  constraint.  Sasaki  (1971)  suggests  finding 
4>  to  minimize 


+ (11) 

where  a,  a.  a,,  a , are  smoothing  parameters  to  be 
determined.  4>  is  the  observed  height  field  data.  ( T 
and  c„  are  the  (observed)  components  of  wind 
velocity,  and  R is  the  spatial  and  tempora'  region 
of  interest  The  first  term  represents  the  desire  that 
4>  be  close  to  the  data,  the  second  that  the  sum 
of  ihe  tendency  and  horizontal  advection  is  small 
and  the  third,  that  the  function  be  smooth  in  t, 
y and  t. 

Since  4>,  cT  and  i „ are  only  measured  at  a (rela- 
tively sparse)  set  of  irregularly  spaced  points. 
Sasaki  assumed  that  the  data  have  been  interpolated 
to  a grid  sufficients  fine  for  numerical  analytic  pur- 
poses After  some  simplitying  assumptions,  the 
Euler  equation  tor  the  mimmizer  of  (I  I)  was  ob- 
tained by  Sasaki  (1971)  and  the  mimmizer  is  found 
to  satisfy  an  elliptic  partial  ditterenti.il  equation  with 
some  boundary  conditions  Various  authors  using 
this  and  other  constraints  (see.  e g . Lew  is  and  Gray- 
son 1972)  have  chosen  values  tor  the  smoothing 
parameters,  and  solved  the  resulting  Euler  equa- 
tions numerically  to  obtain  an  objectively  analyzed 
field 

In  (his  paper  we  develop  a general  mathcmatie.il 
foimilism  b isic  illv  cmhodving  S is.iki  x appioach 
with  the  following  live  moditicaiions 


1)  It  is  not  necessary  to  first  interpolate  the  data 
to  a grid  to  obtain  «1>.  raw  data  is  used  directly 

2)  The  problem  of  providing  or  enforcing  bound- 
ary data  is  eliminated 

3)  The  main  unknown  smoothing  parameters  are 
estimated  from  the  data  to  be  analyzed,  rather  than 
from  historical  data  or  by  guesswork 

4)  The  method  provides  a technique  whereby  raw 
indirect  data,  such  as  satellite  radiance  data,  can  be 
combined  with  direct  data  such  as  balloon  tempera- 
ture data  in  a single  analysis  procedure  This  can 
be  done  without  preconverting  the  radiance  data  to 
temperatures 

5)  Discretization  is  the  last  step  rather  than  the 
first,  so  this  source  of  error  does  not  propagate 
through  the  analysis  This  can  be  important  (see 
Nitta  3nd  Hovermale.  1969) 

The  method  to  be  described  avoids  the  problem 
of  solving  partial  differential  equations  numerically 
However,  it  has  its  ow  n challenging  numerical  prob- 
lems which  we  have  been  able  to  solve  simply  using 
existing  packages  for  medium  sized  (but  not  large) 
data  sets 

To  introduce  our  general  method,  we  begin  with 
the  simplest  nontrivial  example  We  fix  time  as  well 
as  pressure  and  suppose  that  4>  = 4>(  v ,\ ) is  the  500 
mb  height  at  ( r.y ) at  time  I = 0 We  ignore  the  tend- 
ency and  adv  ection  (second  term)  in  ( I 1 ) and  suppose 
observations  4>(i,,v,)  = 4>,.  / = 1,  2.  . , N,  of 

the  500  mb  height  at  the  N stations  with  coordinates 
(t,  v,).i  = 1,2..  ,N,  are  given  We  want  to  ob- 

tain a function  4>  which  is  smooth  and  such  that 
4>(a,.v.)  = 4>,.i  = 1,  2 N.  Consider  the  mini- 

mization of 

iV-‘  i [4H  r,,v,)  - 4>, )*  + \J,(4>).  (1  2) 

i-i 

where 

and  A is  given 

If  one  attempts  to  minimize  (1  2)  by,  for  example, 
writing  the  Euler  equation  one  finds  that  the  solu- 
tion mvolvex  a Green  x function  for  the  Laplaci.n 
operator  A.  A4>  - cHl’/fli-'  + <)-'4>/0v\  and.  un- 
fortunately. this  Green  x function  is  not  bounded 
Sasaki  (1971)  observes  a similar  phenomenon  |xoe 
paragraph  which  includes  Eq  (32)]  but  ignoies  it 
For  this  and  othei  teasons  to  be  discussed,  we  seek 
to  find  the  mimmizer  (in  a suitable  space  of  func- 
tions) of 

v 

N~'  v I <|>( »,.»,)  - <i\ \-  + ,U, „(<!>) 


m 


.3. 


(1.4) 


4,6 
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where 


or,  more  generally, 


m = 2,  3,  . . . . (16) 

Ify„(<h)  is  small,  then  <t>  will  be  smooth 
We  have  deliberately  omitted  any  mention  of  the 
domain  of  integration  If  the  domain  of  integration 
in  (1.5)  and  (16)  is  taken  as  a bounded  region  R 
then  it  can  be  shown  that  the  mmimizer  of  (1  4) 
satisfies 

A"<D  = 0,  (x,y)  * (Xi,yi),  i = 1,2 N, 

where  A is  the  Laplacian,  i.e.. 


*l\  „ , i»m  !v  consideied  .in  thi  icMilt  of  apply  - 
ing  a low -pass  liliii  to  the  data  In  licquciiiy  space 
it  can  be  shown  that  A contiols  the  hall-power  point 
of  the  titter  and  in  the  steepness  ol  the  toll-oft  [see 
Wagner  ( 1971 ).  Craven  and  Wahbu  ( 1979)  and  Wahba 
(1978a)l  In  one  dimension  the  niter  lunetion  J(v) 
as  a function  of  v»  ivenumber  r looks  like  /( v) 
= 1/(1  + Ai-’”)  We  choose  A and  in  Irom  ’he  data 
by  the  GCV  (generalized  cross-validation)  method 
(Craven  and  Wahba.  1979.  Golub  cl  nl  . 1979) 
which  proceeds  as  follows  The  criteria  for  a good 
choice  of  A and  in  is  taken  to  be  the  ability  to  predict 
the  value  of  the  field  wheie  data  are  withheld. 

To  estimate  this  predictive  ability  from  the  data 
let  <1>\1  » be  the  function  which  is  the  mmimizer  of 
(1  7)  with  the  Ath  data  point  omitted  If  A and  in  are 
good  choices,  then  on  the  average  <t>\  ,( v4 ,\ * ) 

- should  be  small  and  we  measure  this  by  the 
ordinary  cross-validation  function 

V"(A)  - N-'  V |<p-‘'„lA(U.v  J - cpj*  (l  g) 

c-i 


A<J>  = 


ss4> 

dx 1 dy1 


and  it  satisfies  the  natural  (Neumann)  boundary 
conditions  This  result  in  a similar  problem  appears 
in  Dyn  and  Wahba  (1979)  We  avoid  the  necessity  of 
solving  a boundary-value  prob'em  by  letting  the 
domain  of  integration  be  — * =£  t,v  x The  bound- 
ary conditions  are  shifted  to  x The  solution  wall  be 
defined  for  -x  < x,  > < x However,  we  will  only 
compute  it  on  R and,  of  course,  it  will  only  have 
meaning  if  there  aic  data  points  not  too  far  from  the 
boundary  We  are  also  assuming  here  that  the  world 
is  flat  in  R , although  the  entire  analysis  that  we  do 
nere  can  be  done  on  the  sphere  [for  the  theory,  see 
Wahba  (1979c)] 

The  solution,  which  we  call  <J>S  „ k,  to  the  problem 
is  as  follows  Find  4>  m a suitable  space  X to  mini- 
mize 


N-' 


£ [<fU,.y,)  - <*>,]* 


i-1 


(I  7) 


This  was  obtained  by  Duchon  (1976a)  and  further 
studied  by  Meinguet  (1978,  1979)  and  Wahba 
(1979a.b)  It  is  known  as  a ‘ thin  plate  spline  ' and 
is  a natural  generalization  to  two  dimensions  of 
the  one-dimensional  smoothing  polynomial  spline 
(Reinsch,  1967). 

We  will  give  an  explicit  computable  formula  for 
<1\  » later  Problems  in  assigning  boundary  values 

arc  eliminated,  and  no  preliminary  analysis  of  the 
raw  data  is  used. 


This  expression  is  difficult  to  compute,  furthermore, 
effects  of  unequal  spacing  of  data  points  are  not 
suitably  accounted  for  For  these  and  other  techni- 
cal reasons  recounted  in  Craven  and  Wahba  (1979) 
and  Golub  ct  al  (1979),  one  should  measure  the 
ability  of  j to  predice  missing  data  by  the  gen- 
eralized cross-validation  function  (GCVF) 

V„(A) 

= N-  V A( ri0*)  - <k];>u(m. A).  (1  9) 

i-i 

where  the  n4(m.A)  are  certain  weights  which  have 
been  given  in  Craven  and  Wahba  (1979)  and  Golub 
ei  al  (1979)  V„,(A)  turns  out  to  have  a collapsed 
representation  vvfvch  is  relativelv  easy  to  compute 
For  each  m - 2,3.4  .up  to  some  preset  maxi- 
mum, VCnfA)  is  computed  as  a function  of  A and 
the  value  A(m)  of  A minimizing  V’JA)  is  determined 
Then  in  is  selected  by  comparing  VC.fAtm))  over  m. 
A computer  implementation  of  this  example  has 
been  made  and  applied  to  data  simtilated  from  a 
mathematical  model  for  a 500  mb  height  field  The 
results  are  presented  in  Section  4 
We  next  generalize  this  approach  to  allow  the 
imposition  of  weak  constraints  Continuing  with 
p = 500  mb  i=tt  we  consider  as  an  example  the 
geostrophic  wind  approximation 

a „ « 1 (')<!>/, O , i„  = 

where  <1>  is  the  500  mb  height.  u„  ar.d  i , arc  eastward 
and  northward  components  of  the  geostrophic  wind, 
and  J is  the  Coriolis  parameter  If  the  eastward 
and  northward  components  of  the  wind  are  mcas- 
uied  al  each  station,  one  can  seek  <I»  to  minimize 
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N-'  £ «rrn«Kx.or.)  - ‘K]J 

<»1 

" td<&  I 

+ /v-  Var1—  + /«. 

/-i 

" /a<M 

+ N-'  2 — -fvA  + (1  10) 

(.i  \ dx  |A  / 

where  N = 3/i,  is  the  measured  500  mb  neight 
and  //,,  r,  are  the  observed  wind  components  at 
station  i ay1  is  a weight  which  is.  ideally,  the  mean- 
square  error  in  the  measured  height  field  oy  is  the 
sum  of  the  mean-square  error  in  the  measured  east- 
ward component  ol  the  wind  and  the  mean-square 
error  in  the  geostrophic  approximation  to  the  true 
eastward  wind  uy  has  the  corresponding  meaning 
for  the  northward  component  of  the  wind 

For  m 3 an  explicit  formula  for  the  minimizcr 
4>\  m K of  (1  10)  will  be  given 

Since  we  are  going  to  choose  A from  the  data,  it 
is  only  necessary  that  oy'/oy  and  uy'/oy  a.e  known 
reasonably  well  Assuming  all  mean-sqjare  errors 
are  known,  it  has  been  suggested  by  Reinsch  ( |967) 
and  others  to  choose  A so  that  the  first  three  terms 
in  (I  10)  with  <t>  replaced  by  <1\  * sum  to  1.  How- 

ever, it  has  been  shown  (see  Wahba,  1975'  Craven 
and  Wahba,  1979)  that  this  will  lead  systematically 
to  undersmoothing 

The  idea  of  the  generalized  cross-validation  func- 
tion extends  to  the  choice  of  A and  m in  the 
minimizer  of  u 10)  and  we  can  obtain  the  GCVF 
which  Cun  be  minimized  to  estimate  good 
values  of  in  and  A 

In  this  example  where  ir,J.  < r22  and  uy  may  be 
different  the  minimizcr  ol  the  GCVF  estimates  A 
and  hi  which  best  predict  missing  data  points, 
inversely  weighted  b",  the  appropriate  uy 

We  next  turn  to  the  analysis  of  a temperature 
field  using  both  direct  (balloon)  and  remote  (satellite 
radiarce)  data  We  assume  that  all  data  are 
measured  at  / - 0 and  th  9>(r,\  41)  represents  the 
temperature  The  data  consist  of  direct  measure- 
ment of  the  temperature  from  station  / at  pressure 
p,,  and  indirect  satellite  measurements  of  radiances 
lt(u)  at  frequency  i<  and  subsatelhte  point  ( \, ,\t) 
In  the  simplest  case  (cloudless,  looking  down), 
after  some  linearization  and  approximations  1 a 
known  function  r,(i’)  of  the  measured  radiance 
/,(»')  can  be  related  to  the  temperature  <l>  by 

r,(v)  = ( K(v.p)<Xi{\i p)dp.  (I  II) 


1 In  ihi  pvvixs  *'t  luii  in /tin.  ft*  »>bt  tin  tl  III  his|  li 

hiM  h iihiI  Om  iM  i»’'l  tin  this  In vt  iuns  luM  h\  milwini. 
thi  h iIUhui  J ii  1 iloni  b\  li  mil  thi  t uli  titu  i!  il  t in  wh  tl 
folhiv  \ Uim  in  il  l*)| 


where  K{v,p)  is  known  for  each  frequency  v = »„ 
....  i',  (see  Fnt7  et  al . 1972) 

Thus  wc  seek  to  minimize 

/V*'  £ + N-'  V <r/* 


xy  K(v,p)<tHx,j,,p)(]p  - r,{v)  j 

+ A7*(4>)  (1  12) 

where  N is  the  total  number  of  observations  and 

= few) 

xffff — }*  dxdulp.  (1.13) 

J J J \dA°'dy°‘dp°') 


We  will  give  an  explicit  formula  for  the  minmvzer 
*I\  At  of'!. 12)  and  the  GCVF  V„(A)  for  th’s  prob- 
lem for  " a 2.  In  theory,  there  is  no  difficulty  in 
adding  wc'.k  temperature  constraints,  or  in  car.ying 
out  the  analysis  in  three  space  variables  and  one 
time  variable  with  direct  data,  indirect  data  and  w eak 
constraints  (A  finite  number  ot  strong  constraints 
can  be  added,  too,  and  we  briefly  indicate  how  ) In 
practice  the  method  has  computational  limits  The 
computation  of  x requires  the  solution  of  a 
linear  system  o‘  dimension  close  to  the  number  N 
of  data  ' and  “weak  constraint"  terms  The  com- 
putation of  the  GCVF  required  'he  solution  of  an 
eigenvalue  problem  of  size  /V  We  are  obtaining  very 
good  results  with  N up  to  as  large  as  !40  with  p.esent 
methods  on  the  (Jmvac  1110  at  the  University  of 
Wisconsin,  Madison,  but  improved  algorithms  will 
have  to  be  developed  to  go  beyond  this  point  on  this 
size  machine  There  is  reason  to  believe  that  this  can 
be  done  Some  algorithms  handling  four  times  as 
many  points  in  certain  special  cases  have  been 
developed  by  Paihua  (1978).  Other  numerical 
methods  suitable  for  large  data  sets  are  suggested 
in  Wahba  ( 1980a, b)  <t\  x is  found  in  terms  of  coef- 
ficients of  certain  basis  functions,  so  that  the  bulk 
of  the  numerical  work  is  only  done  once  lor  each 
set  of  data  A.  and  in  certa.n  cases  its  deriva- 
tives, can  be  evaluated  on  a fine  grid  essentially 
for  ‘lrce  ' 

We  briefly  mention  the  relationship  of  this  work 
to  some  other  approaches  in  the  literature  Fritsch 
(1971)  discusses  a related  form  of  two-dimensional 
spline  objective  analysis  Wagner  (1971)  analyzed 
some  of  Sasaki's  variational  objective  analysis 
methods  from  the  point  of  view  of  their  properties 
as  low-pass  filters  and  experimented  with  the  param- 
ctci  wluJi  contiols  the  halt  power  point  ot  the  filler 
(hcie  A),  w :h  the  equivalent  of  our  in  - 2 The 
Fields  ol  Information  BlcrJmg  developed  by  M 


480 


1126 


ORIGINAL  FWK 

OF  POOR  QUALITY 


MONTHLY  WEATHER  REVIEW 


Moll  and  associates  also  has  the  capability  of  blend- 
ing diltcicnt  types  of  data  In  our  notation  Hull's 
apptoich  is  to  minimi/c  a disvicte’  appioMinaiion 
to  tne  *!•  which  minimizes 

V - 4\)» 

1*1 

(see  Holl  (1976).  Eq  (5))  The  data  arc  assembled 
on  a regular  discrete  grid  and  <1>  is  computed  only  on 
the  same  grid  Derivatives  are  replaced  by  finite 
diftcrences  Some  ot  the  smoothing  is  effected  by  the 
fact  that  there  are  more  terms  in  'he  sum  above  than 
there  are  grid  prants  on  which  <<'  is  to  be  computed 
and.  in  addition,  the  system  ot  equations  to  be  solved 
to  obtain  the  mimmizer  is  solved  approxtn  jtely  by 
iterative  techniques,  where  the  choice  of  weignting 
parameters  and  number  of  iterations  will  have  a 
filter.ng  effect  {see.  also.  Wahba.  1980a,  Section  8) 
The  discussion  would  not  be  complete  without 
noting  that,  in  general,  variational  objective  analysis 
methods  involving  a quadratic  non-negative  definite 
penalty  term  like  X./,„(<t>)  are  intimately  related  to 
ccrtam  terms  of  (Gandin)  optimum  objective  anal- 
ysis methods  We  illustrate  this  remark  by  a simple 
discetized  example  Consider  a vector  of  variables 
of  interest  x = (r„  r*.  . . . \„)’  Suppose 
+ c,  is  observed  fori  = I,  , /».  where  the  e{  are 
supposed  to  be  zero  mean  independent  Gaussian 
landom  variables  with  variance  a-  Suppose  that  the 
x,  have  a prior  Gaussian  distribution  vzith  E\,  = 0 
and  E\,\,  — a, j.  where  E is  mathematical  expec- 
tation Letting  51  be  the  ii  x ;i  matrix  with  i/th 
entry  <t„,  then  ihe  conuitiona’  expectation  x of  x 
given  the  data  z = (z,,  . . , z„)'  is 

x = X(X  + <r*/)-lz, 

where  / is  the  n x n identity  matrix  However, 
it  is  also  true  that  x given  above  is  the  solution  to 
the  minimization  problem  find  x to  minimize 

S Ui  “ -.)*  + X7(x), 

I- 1 

where  J(x)  = x‘£"'x  and  X = cr’/n  Returning  to 
functions  4>(x.v)  for  example,  there  is  a prior 
covariance  on  d't  v)  such  that  4\  „ »,  the  mimmizer 
of  ( 1 7)  has  the  property  that  <I\  *(  r.y ) is  the  con- 

ditional e.pectation  of  tI>(t,\)  given  the  data 
- <l>(  r,.v  ,)  + <,.  where  the  e,  are  independent  zero 
me  m Gaussian  (error)  random  variables  with  com- 
mon variance  a-  The  theory  behind  this  remark 
can  be  tound  in  Kimcldorf  and  Wahba  (1970,  1971) 
and  Wahba  '.1978b,  1979c)  The  chore  of  m controls 
the  rate  of  decay  of  the  power  spectrum  of  the 
signal  with  wavenumber,  equivalently  the  shape  of 
the  low-pass  filler  in  the  frequency  domain  Thicbaux 
t!9S0)  discusses  Ihe  relationship  of  m to  prior 
covariances  in  some  related  but  slightly  different 


VmtntL  I0K 

examples  Details  tor  low  -pass  filtering  on  the  sphere 
In  v ari.ition.il  methods  may  be  tound  in  Wahba 
(I'l/’L.  .billion  4 li 

In  Section  2 wo  piovnle  the  solution  to  a gcnoi.tl 
minimization  pioblcm  ot  which  all  the  previously 
mentioned  pioblems  aie  special  cases  In  Section  3 
we  describe  the  GCVb  which  allows  the  estimation 
of  X and  in  from  the  data  being  .inalv/ed  In  Sec- 
tion 4,  results  of  a Monte  Cat  to  test  of  the  method 
is  given,  using  realistic  simulated  500  mb  height  data 
where  the  "true  bold  is  known  Numerical  methods 
used  are  somewhat  nonstandard  and  are  descnbed 
in  some  detail  in  the  Appendices 

Analysis  of  the  heicht  field  via  minimization  cf 
(1  4)  is  an  isotropic  method  Thicbaux  (1977)  has 
provided  some  evidence  that  an  improved  analysis 
may  be  obtained  using  methods  which  have  different 
north-south  and  east-west  scales  This  feature  may 
be  incorporated  here  by  making  a change  of  scale 
r — * At  and  \ — * k~'\.  A good  scale  parameter  A 
may  be  estimated  bv  GCV  simultaneously  with  X 
and  m Some  verv  preliminary  numerical  results 
with  actual  reported  500  mb  height  data  from  the 
U.S  rawmsonde  network  suggests  that  the  A = 1 
0 e . inotropic)  an.dvsis  can  be  improved  upon  by 
es.imating  A (see  Wendelberger  1981)  We  do  not 
discuss  anisotropic  methods  further  here 

krciss  ( 1979a, b)  notes  that  for  successful  numeri- 
cal soluti.  n of  ceriain  differential  equations  related 
to  numerical  weather  forecasting,  -t  is  desirable  to 
have  initial  condemns  that  have  certain  continuity 
properties  We  conieciure  that  the  methods  sug- 
gested here  can  oe  used  to  provide  these  initial 
conditions 

2.  Solution  of  a general  minimization  problem. 

In  this  section  we  give  a solution  to  a general 
minimization  problem  of  which  the  minimization 
problems  of  Eqx  il  7).  (1.10)  and  (1.12)  are  spe- 
cial cases 

Oui  resdlts  hold  in  any  number  of  dimensions, 
where  most  meteorological  problems  of  interest  will 
involvc  i1  = 2.  3 or  4 T he  J = 1 case  results  in  the 
famihar  polynomial  smoothing  spline  (see  Reinsch. 
1967)  We  will  sav  a lunction  //  of  J variables  .r,. 
xt,  , xu  is  ‘ smooth"  if  defined  by 


is  small 

We  seek  to  find  a u which  is  simultaneously  com- 
patible with  ’.lie  data  z_.  , Z\,  and  is  appro- 

pi lately  smooth  T he  data  arc  assumed  to  be 

= L,u  + <t. 
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where  the  L,  are  (any)  continuous  linear  functionals 
of  u and  the  <\  arc  measurement  errors  A rigorous 
definition  of  a continuous  linear  functional  as  betng 
used  here  is  given  in  Appendix  A,  but  we  note 

the  most  useful  ones  here  Lett*  = ( t,* x**) 

be  a fixed  point  in  </  dimensions  Then 

In  = //(t*) 

is  a continuous  linear  function.il  for  each  fixed  t* 
provided 

2 hi  - </  > 0 
and 

, akn  I 

Ln  = 1 

flv?"  -flC-  L, 

with  a,  t-  • + a.,  = A is  also  a continuous  linear 
functional  for  each  fixed  t*.  provided 

2m  - 2k  - d > 0 

This  allows  incorporation  of  the  winds  as  estimates 
of  the  gradient  of  the  pressure  field  via  the  gco- 
strophic  approximation  L of  the  form 

Lu  - |*  |tf(t r,,)«(x,.  . . . xd)ux,  • -Jxd 

n 

is  a continuous  linear  functional  if,  for  example, 
fl  is  a bounded  set  and 

j -j  |K(r,.  ,x(,)|i/r,  dx,  < x. 

u 

This  allows  merging  of  radiance  data  with  direct 
temperature  data  in  the  objective  analysis  of  tem- 
perature fields  We  remark  that  Lu  = «(t*)  is  not 
a continuous  linear  functional  if  in  = 1,  </  = 2.  and 
this  leads  to  'he  dittKulties  mentioned  previously 
tn  regard  to  the  minimization  of  (I  2) 

We  suppose  that  thee, are  independent  zero  mean 
errors  with  Ec,:  = <r,'  We  seek  to  find  u in  a suit- 
able (Hilbert)  space  of  functions  (defined  in  Ap- 
pendix A)  to  minimize 

N-  V (Ljii  - s.j-Yr.-*  + KJm{u)  (2  I) 


In  this  Section  A and  m are  fixed  In  Section  3 we 
show  how  to  choose  A and  in  We  will  give  an  ex- 
plicit formula  for  ihe  u which  minimizes  (2  1)  for 
general/,  The  special  cases  ( I 7)  (I  10)  and  (1.12) 
and  others  ot  interest  van  then  be  deduced  Com- 
putational algoi ithms  are  discussed  in  the  Appendices 
and  a numerical  test  ot  the  method  on  simulated 
500  mb  height  d u i is  given  in  Section  4 

1 lie  minimize  r cilliiu.,  ,oli2  1)  is  expie  -sible 

III  Icll.ls  Ol  polviiouil  ,1'  III  lot. 'I  il.  -!v»  less  ill. Ill  III 

and  ihe  tutnl  internal  solutions  ol  the  itciated 
Laplacian  Uetou  suiting  the  icsuit.  vve  detine  some 


notation  In  i/-dimcnsionuI  space  thcie  arc 


"*( 


d + m - I 


polynomials  of  total  degree  less  than  or  equal  to 
m - l We  let  {d>. t be  these  W polynomials.  For 
example,  if  d = 2,  m = 3.  then  M - 6 and 

d>iU,.x,)  = I.  <f>Ar,.ti)  = r, 

<k(U.X|)  = x,  (2  2) 

0i(x,,xJ)  =.r,’,  d*.(r„vj)  = r,rj 
d>«(x„x2)  = r,’ 

Observe  that  J„,(d>, ) = 0.  «•  = 1,2...  . M , so  th  i’ 
polynomials  of  total  degree  < in  - I are  considered 
infinitely  smooth  by  this  method.  We  define  the 
Laplacian  A by 

•>  /)*« 

Au  = S . 

i-i  flu 

If  u and  all  its  derivatives  up  to  order  in  - 1 -re 
continuous  and  arc  zero  at  infinity,  then  by  inte- 
gration by  parts,  one  has 

J„{u]  = mA"W\,*  dx,, 

J *< 

Thus,  the  iterated  Laplacian  A'"  would  play  a role 
in  an  Euler  equation  approach  for  the  so'uuon  of 
the  variational  problem  (2  II.  although  vve  do  not 
use  that  method  to  obtain  the  solution 

Letting  s = (>  i,  . vrf).  t = (r,,  . . , .r,()  ano 

|s  — 1 1 = (V  ( x,  - v,)T*. 

i-i 

then  the  fundamental  solution  of  the  iterated 
Laplacian  is  given  by  E*(s,t)  defined  by 

E„(s.t)  = £H|s  - t|L  (2.3) 


Inr,  d even, 

/ — J W/I+I  + /H 
Q ~~  ^ 

w 21’"‘,ji''  !(m  - !)'(/«  — «//2)' 
d odd,  0m  = 

T(d/2  - in) 
2t‘"n',‘*(w,  - 1)' 


E(t)  = 


[/.,„(st)  has  the  propertv  A”,E,(s  t)  = f>(x  - t). 
where  the  subscript  (s)  indicates  that  A"  is  ap- 
plied to  consiueied  a (unction  of  s and  f»  is  the 
licit. i tuiict'on  see  Vhu.utz  tll>(,(<)|  Wo  m now 
si  ic  the  icsuli 

let  /.,.  . /.,  be  ,V  linearly  independent 

continuous  linear  (unction  ils  and  suppose 
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L,  V = 0.  A = 1,  2 S'.  (2  4) 

implies  that  all  the  a,  arc  0. 

I hen  the  solution  to  the  problem  Find  //x  „ „ to 
minimize 

AM  V fJ±Zl iV  + U ,„(„)  (2.5) 

i-i  \ <*i  ) 

is  unique  and  has  the  representation 

Wa.».xU)  - £ r^(t)  + £ </,<*>,( t),  (2.6) 

J-l  r-1 

where 

f/0  = LMiE„( t.s),  / = 1.2 N,  (2.7) 

and  Lfl,  means  the  linear  functional  L , applied  to 
what  follows  considered  as  a function  ofs.  The  co- 
efficients c = (C|,  . c\Y  and  d = ( d </„)’ 

are  determined  by 

(K  + NA£>„l)c  + Td  = z.  (2  8) 

T'c  = 0.  (2  9) 

where  K is  the  N x N symmetric  matrix  with 
jAih  entry 

Ljtg)LuttEm{SA),  (2  10) 

T is  the  N x M matrix  with  jmh  entry 

M>,.  (2  11) 

and  Da  is  the  NxN  diagonal  matrix  with  j/th 
entry  a,  An  outline  of  the  derivation  is  given  in 
Appendix  A. 

Examples 

The  simplest  example  is  when  the  bounded  linear 
functionals  are  all  evaluation  functionals  Ljt 
= i/(t,),  i = l,2,..  , N For  condition  (2  4)  to  be 
satisfied  it  is  necessary  that  the  A'  points  t,,  . tv 

do  not  he  in  a hyperplanc  of  dimension  d - 1 or  less 
For  example,  if  d = 2,  then  we  need 


and  the  N points  must  not  fall  on  a straight  line 
Then 


(2  12) 

(2  13) 

Lj4>*  = 

(2.14) 

If 

Lu  - f Kixz)ui\tm.xt*.\2)Jxi% 

Ja 

then 

£<x,.r„tj)  = [*  K{y3)EM(xt*.xt*kY,.x,.Xi, 

r i )</>i. 

etc  In  ecnci.il.  i of  this  feint  in  i\  not  be  known 
explicitly,  .md  then  a qu.uli.itiiic  .ippiosimation 
may  be  nccess.iiy  An  appiopii.ito  quadrature 
approximation  foi  a sunilai  pioblem  can  be  found 
m Dyn  and  Wahba  (1979)  \n  example  of  £,  when 
L,u  involves  dei natives  is  given  in  \ppendix  B 

We  make  the  important  obseivation  that  esti- 
mates of  derivatives  o fit  up  to  oidei  / may  be  ob- 
tained by  diffeientiating  ii\„k  analytically,  pro- 
vided 2 in  - 2/  — i/  > 0. 

We  remark  that  in  the  more  familiar  Hilbert 
spaces  of  functions  for  which  it  is  only  assumed  that 

|J---ju;(v1.  . . , xt)dx £/trfj  < *, 

the  evaluation  functionals Ltu  = u(tk)  are  not  con- 
tinuous linear  functionals. 

3.  The  generalized  cross-validation  (GCV)  method  for 
choosing  A and  in 

Wc  describe  the  generalized  cross-validation 
(GCV)  method  foi  choosing  A.  and  m.  Wc  emphasize 
that  A and  in  are  the  "tuning  parumeteis“  of  this 
method  (every  objective  analysts  technique  has 
tuning  parameters')  and  one  of  the  novel  features 
being  reported  hetc  is  the  ability  to  estimate  good 
values  of  k and  m automatically  from  the  data  being 
analyzed  Frequently,  this,  task  is  performed  by 
trial  and  error  We  rcmait  that  GCV  also  can  be 
used  with  othei  methods  but  wc  do  not  pursue 
this  point  here 

To  describe  the  GCV  method,  we  first  define  the 
"ordinary"  cross-vahdatnm  function  V,„n(i\).  Let 

« n be  the  minimizer  of 

AM  *£  (LjU  - z,fi «-  A JJu), 

>-\ 

w 

i c . the  Ath  data  point  has  been  left  out  Then. 

(3  1) 

is  the  diflcrence  between  the  Ath  data  point  and 
an  estimate  of  the  Ath  data  joint  trom  the  lem.uning 
data  when  in  and  k arc  uscJ.  If  nt  and  k are  a good 
choice  the  quantities  in  (3  l*  should  be  small  on  the 
average  and  this  can  be  measured  by 

VVU)  ■=  .V  • £ (£*/,■**■.,-  cjvr*.  (3.2) 

i-l 

The  geneial  idea  is  that  i.rvc  would  choose  k and  in 
to  minimize  (3  2)  It  turns  out  ih.it  (3  2)  is  very 
difficult  to  compute  I-uithermoic.  it  is  shown  in 
Craven  and  Wahba  (197u_  hereafter  CW)  Golub 
cl  al  (1979.  he i caller  (i3E\\)  and  Wahba  (1977) 
that  from  a theoietie.il  pustt  of  view'  it  is  better  to 
choose  k and  in  to  minrcnzc  a certain  weighted 

vr»rv<«Art  V t X \ of  V X 1 lit-FmOll  hv 
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Vm(A)  = Af-»  £ [Lku'?A.K  ~ (3  3) 

*•1 

where  wk(m,X)  are  weights  defined  by 
*»•*(«,  A) 

= |1  - a«(m,A))VIl  - N'x  T ci^m.A)]’,  (3  4) 

4-1 

where  the  akK{m,k)  satisfy 

— Lkii>mK  = «a(m,X).  (3.5) 

dzk 

If  the  am  were  all  the  same  the  weights  would  be  1. 

We  first  review  the  results  from  CW  and  GH  W,  on 
the  optimality  properties  of  the  GCV  estimates  A 
and  tii  which  are  obtained  as  the  minimizers  of  (3  3). 
Then  we  provide  a simplified  expression  for  V,„(\) 
which  is  amenable  to  computation,  and  m fact  is 
much  easier  to  compute  than  A) 

The  optimality  properties  of  A and  m are  based 
on  assuming  that 

Zj-Lj u + e„  j - 1.2,  . . N, 

where  u is  the  "true"  field  and  is  an  error  which  is 
assumed  to  have  mean  zero  and  mean-square  cr/.4 

* In  fact,  here  and  ilsew.iere  il  is  only  necessary  that  the  a, 
are  relatively  correct,  since  one  can  multiply  all  the  a,  by  an 
arbitrary  constant  which  then  gels  absorbed  in  X 


We  define  an  error  function  when  in  and  A are 
used  as 

/?„( A)  = EN-'  £ (L,u  - L,uv.m  (3.6) 

j-i 

where  the  E means  expected  value.  Rm( A)  is  not 
computable,  of  course,  since  u is  not  known.  How- 
ever, it  is  shown  in  GHW  and  CW  that  under 
rather  general  circumstances  the  A and  m which 
minimize  V'„( A)  are  good  estimates  of  the  A and  in 
which  minimize  /?„(A),  and  £Vm(A)  **  R,„{ A)  + a 
constant,  for  A near  the  mintmizer  of  /?,„( A) 

We  now  give  a different,  but  cqjivalent.  expres- 
sion for  V^fA)  of  (3  3)  which  is  suitable  for  efficient 
numerical  evaluation  First,  it  can  be  shown  by  the 
same  reasoning  as  in  CW,  Lemma  3.2,  that 

EkU\  m,k  zk 

E (U«s.w.x  - ZfcVU  - akk(m, A))  (3  7) 

and,  substituting  this  into  (3.3)  gives  an  alternative 
expression  for  V„(A)  of  (3.3), 

Vm(A)  = N~l  2 ( Lkus  m.k  - zk)*<rk-' 

x (I  - A'"’  2 (3-3) 

i- 1 

where  alt  = a, Ant. A)  Letting  A„(A)  be  the  N x N 
matrix  defined  by 


Fig  I I option  of  nunM  ndio-vonde  ^ UionN  jnd  bound  try  of  grid  u^d 
for  oaknlion  ol  the  in  ifysi*  i 1 he  st  ition  jt  San  Joan,  Pucrio  Rko.  ts  not 
shown  ) 
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(w(M/t  / ■■  (-1-  • • -\ ) • 


then  the  expression  (3.8)  for  V*(X)  can  be  written 
in  the  equivalent  form 


= ^32£l[L  ~ 

" ||N-  Tracc[/  - ZUX)]I|J  ' 


(3.9) 


where  /)„  is  the  di.ijnm.il  matnx  with  jith  entry 
ir,.  the  trace  ot  a matrix  is  the  sum  of  its  diagonal 
entries,  and  " i»  the  I is.lide.in  noun  \n  expln.it 
formula  lot  / - /\,„(X)  m tenns  ol  the  matiiees  K. 
'1  and  I)  is  given  in  Appendix  C In  Appendix  C we 
describe  a mnneiieal  algorithm  for  computing  \',„(X) 
and  linding  the  minimizing  X for  each  <».  as  well  as 
computing  the  coefficients  c and  d This  algorithm 
was  successfully  implemented  for  the  special  case 
tl  = 2.  L/i  = nit,),  c t*  = <r\ in  = 2.  3.  4.  5 or  6 and 


Fig  2 /?„< k)  and  V.UI  m = 5 Example  t 
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N = 120,  to  give  the  numerical  results  in  the  next  Wisconsin  that  was  based  on  an  earlier  model 
section  developed  by  Sanders  (197 1)  The  location  of  the  1 14 

. stations  is  given  in  Fig  I The  equations  generating 

4.  Numerical  experiments  the  f^c{j  are  given  in  Appendix  B Discussion  of  the 

We  have  programmed  and  tested  the  method  for  rationale  behind  the  model  appears  in  Koehler’s 
analyzing  data  from  simulalcd  500  mb  height  fields  (1979)  thesis.  Contour  maps  of  the  model  fields  ap- 
using  simulated  data  at  N = 114  North  American  pear  below  together  with  contour  maps  of  the 
radiosonde  station  locations  The  simulated  data  analyzed  fields  determined  from  the  simulated  data, 
were  obtained  from  a mathematical  model  of  500  mb  Data  were  simulated  by  computing  the  true  500  mb 
height  fields  used  by  Dr  Thomas  Koehler  of  the  height  at  station  i by  calling  Koehler’s  program  and 
Department  of  Meteorology  at  the  University  of  adding  a simulated  measurement  error  The  simu- 
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Fig  4a  The  model  (dashed  line)  and  anal) zed  (solid  line)  fields  »nh  a **  J 


lated  measurement  error  was  obtained  by  calling  the 
pseudo  random  number  generator  RAENBR  in  the 
University  of  Wisconsin  Academic  Computing  Cen- 
ter library.  This  program  obtains  a pseudo  random 


normally  distributed  number  with  mean  0 and  stand- 
ard deviation  1 and  multiplies  this  number  by  a 
constant  which  is  given  here  as  the  standard  devia- 
tion of  the  measurement  error  This  procedure  re- 


Fig  4b  As  in  Fir  4a  txccpl  wnh  a «*  10 
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Tig  4c  As  in  fig  4a  except  with  it  - 15 


suited  in  a set  of  114  simulated  measured  500  mb  field,  we  go  back  to  Section  2,  with  d = 2.N  = 114, 
heights  which  were  then  used  to  obtain  an  analyzed  L,u  = u(t,).  where  t,  = (t,.v,).  the  coordinates  of 
field  This  is  the  simulated  data  sector  z To  te-  the  ith  station  We  have  considered  m = 2.  3,  4.  5 
capitulate  the  formulas  tor  obtaining  the  analyzed  and  6 The  analyzed  field  is  given  by  » of  Eq 
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(2.6).  where  (,  is  defined  by  (2.12).  <b.  is  defined  by 
(2  2)for/u  = 3 and  analogous  formulas  for  other  in. 
K in  defined  by  (2  10)  and  (2.13).  and  I is  defined  by 
(2.11)  and  (2.14)  For  each  in.  l',„(A)  is  defined  by 
(3.9),  where  D„  is  (aken  as  (he  identity  matrix  since 
all  measurement  errors  are  assumed  to  have  the 
same  standard  deviation.  l'„(A)  is  computed  as  in 
Appendix  C but  since  D„  is  the  identity  matrix, 
then  T = T.  The  earth  was  assumed  "flat"  and 
latitude  and  longitude  coordinates  were  treated  as 
( \.v)  for  the  analysis  of  the  field  and  then  converted 
back  to  latitude  and  longitude  in  the  contour  maps 
given  below.  To  minimize  round-off  errors  x and  y 
were  rescaled  to  be  loughly  of  magnitude  one  in 
absolute  value  for  the  calculations. 


In  the  first  senes  of  experiments  we  considered 
one  field  (to  be  sailed  I- sample  It  and  considered 
<r  = 5.  10.  15  and  20  ni  i or  cue  h data  set  (i  e .value 
of<r)  we  let  in  = 2.3.4. 5 and  (>  Let  us  first  examine 
the  choice  of  A In  the  first  example  discussed  here. 
it  = 10  and  in  = 5 (hi  = 5 was  the  "estimated  * w 
for  this  case,  moie  about  that  next.)  Hg  2 gives  a 
plot  of  VS(A)  vs  A jndft,(A)  Here/(,„(A)  is  defined  as 

Rm( A)  = -V-'  v [„,„((,)  - „(t.)l*. 

i-i 

where  «(/,)  is  the  "true",  i e . model  500  mb  height 
field  at  station!  Theoretically.  V,„(A)  should  "track" 
/?„( A)  near  the  minimum  of  /?„,( A)  (see  Craven  and 
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Wahbn.  1979.  Golub  cl  til  . 1979)  In  practice  /?,„( A) 
is  not  known  but  in  this  example  which  is  tan ly 
typical,  it  can  be  seen  that  the  mmimizcr.  call  it  A. 
of  V^tA)  is  a very  good  estimate  of  thcminimizer  of 
RJ\)  In  fact  the  ,inefliciency  ' R,„{k)l mm,  K,„(A) 
= 1.005 

Fig  3 illustrates  how  in  is  chosen  from  the  data 
and  how  good  this  choice  is  To  study  variability 
of  the  method  with  m and  a,  the  same  set  of  114 
pseudo-random  numbers  has  been  used  in  each  of 
the  20  --  5 > 4 analyses  behind  Fig  3 The  pseudo- 
random number  for  station  i was  multiplied  by 
cr-5.  10.  15  and  20  m turn  to  get  lour  data  sets 

Fig  3a  plots  V at  the  minimizing  .aiuc  X for 
in  = 2.  3.  4.  5 and  6 for  the  first  data  set  (<r  - 5) 
The  minimizing  value  X will  be  dilterent  mcach  case 
According  to  Fig  3a  the  choice  of  m = 5 would  be 
made  from  the  data  For  comparison  R„,( X)  is  also 
given  Figs  tb,  3c  and  3d  give  the  same  plots  for  the 
other  three  data  sets  with  it  = 10.  15  and  20  It  is 
seen  that  the  choice  nt  = 5 would  be  made  from  the 
data  in  each  case  In  general.  R,„( X)  is  very  close  to 
min,  RJ X)  and  these  plots  suggest  that  choosing  m 
to  minimize  V,,,(A)  will  result  in  a good  choice  ofm. 
However  /f„(A)  form  = 4andm  = 6 is  only  slightly 
larger  than  A’ ,(A)  The  two  points  corresponding  to 
the  in  = 5 ir  — 10  cjse  of  Fig  2 are  circled  in  Fig 
3b  Figs  4a-4d  give  the  model  and  analyzed  field 
for  «i  = 5 with  the  estimated  X for  each  <r  tried  The 
model  field  contours  (dashed  lines)  are  the  same  in 
each  figure  The  analyzed  field  contours  are  solid 
lines  The  contours  are  labeled  in  tens  of  meters 
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From  the  data  behind  Fig  3 one  can  establish  that 

,(X)|'  •'  is  between  0 6<r  and  0 8<r  Thus  the  meas- 
urement noise  is  being  filtered  out  to  give  a better 
estimate  overall,  of  the  station  500  mb  height  than 
the  measured  heights! 

In  practice,  of  course,  we  want  the  analyzed  field 
to  be  a good  estimate  of  the  true  field  over  a whole 
region,  not  just  at  the  points  where  it  is  measured 
To  determine  how  well  this  goal  is  being  met  the 
RMSE  (root-mean-square  error)  of  the  analyzed 
field  over  a 17  x 26  grid  covering  the  region  out- 
lined over  North  America  with  a solid  line  in  Fig  l 
was  computed  This  RMSE  is  defined  as  follows. 

RMSE  = RMSE(m.X) 

| I?  2rt  lit 

— — — [» I .<*>,)  - «(0,.<W  . 

17  x 26,.,  J 

where  X is  the  estimated  X for  each  m The  RMSE 
is,  of  course,  an  overall  measure  of  how  well  an  en- 
tire field  can  be  estimated  over  a region  Irom  the 
1 14  data  points 

Fig  5 gives  plots  of  RMSE(/it.A)  for  the  four 
values  of  it  tried.  RMSE(m.X)  is  generally  greater 
than  [ /?,„( A)l’ 2.  For  comparison  j7?„(X)l'  is  also 
plotted  The  excess  of  RMSE(m.X)  over  [/?,„(X)]|,J 
reflects  the  inability  of  the  method  io  interpolate 
between  data  points 

It  can  be  seen  from  Fig  5 that  by  the  RMSE 
criteria  an  m somewhat  smaller  than  5 would  give 
slightly  better  results  in  these  examples  To  what 
extent  this  result  on  a model  field  carries  over  to  real 


Flu  6 I he  moikl  tulJ  ul  «*hed  line)  and  two  an  field*  (*olid  lino)  with  repluated  data 


ORIGINS- 
OF  POOR  QuaUT 


1136 


MONTHLY  WEATHER  REVIEW 


490 


Voiume  108 


fields  is  really  a question  of  how  closely  the  model 
represents  the  real  world  with  respect  to  the  fcatinc 
being  tested 

To  get  a feel  for  the  variability  of  the  analysis  with 
actual  variation  in  the  measurement  errors.  Exam- 


ple 1 above  with  <r  = 10  was  replicated  beginning 
with  a new  set  ol  lanJom  numbeis  l „(\|  was  com- 
puted lioni  the  data  and  in  - 5 was  again  chosen 
from  the  data. 

The  estimated  value  A in  the  second  replicate  was 


Fig  7 Four  examples  with  a = 10  (a)  ALON„  - 105.  (b)  ALONu  = 100,  <c)  ALON„  =■  95. 

<d)  ALON.  = 90 
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very  close  to  A in  (he  first  replicate  {Remcmberth.it 
the  "model  ' field  is  identical  in  both  cases  ) How- 
ever. while  the  RMSL  was  13  69  in  the  first  replicate, 
it  was  17.13  in  this  one  The  model  and  the  two 
analyzed  fields  for  this  case  appear  in  Fig  6 


I inally.  we  look  at  vanations  as  the  field  varied 
Thicc  other  fields,  in  addition  to  the  first  example, 
were  generated  by  moving  the  field  from  west  to 
east.  The  four  fields  are  characterized  by  the 
parameter  ALONu  in  the  model  in  Example  I, 


Fig  7 {Continued) 
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AL0N„  = 95. 1 he  other  three  cases  are  90.  100  and 
10^  I he  second  icplicate  with  \I  <)N„  - *)S  is 
Used  in  tins  senes,  and  the  same  set  ot  1 1-1  oiigm.il 
random  numbers  used  in  the  second  replicate  is 
used  in  the  other  three  examples  here.  A set  of 
data  vv>th  <r  = 10  was  geneiated  for  each  ot  these 
thiec  new  Reids.  The  estimated  values  of  in  weie 

ALON0  m 

105  4 

100  4 

95  5 (already  given) 

90  6 


Fig  7 gives  a plot  of  the  true  and  analyzed  field  in 
each  of  these  four  cases.  The  RMSE  values  were 

ALON0  RMSE(m.X) 

105  8.40 

100  10  80 

95  17.13 

90  13.08. 

We  have  used  cr  = 10  as  a typical  value  here 
because  the  results  in  Thicbaux  (1977.  Table  I. 
first  row',  fifth  column)  suggest  that  the  root-mean- 
square  measuicment  error  at  Topeka  is  less  than 
10  m (assuming  zero  mean  measurement  eirors) 
Note  an  earlier  study  (Air  Weather  Service.  1955) 
has  estimated  a at  ~20  m. 

The  question  of  whether  in  practice  m and  X can 
be  effectively  chosen  once  and  for  all  or  should  be 
estimated  d>namicall>  from  the  data  has  not  been 
completely  addressed  here  This  question  can  be 
addressed  with  "model  ’ data  only  to  the  extent 
that  the  model  represents  the  real  world  with  respect 
to  the  phenomena  being  studied  Furthermore,  if  the 
criteria  is  minimum  RMSE  then  this  question  can- 
iio*  be  answered  with  real  data  unless  it  is  available 
on  a fine  grid  Predictive  ability  on  the  measure- 
ment grid  can  be  studied  in  experiments  philosophi- 
cally like  those  ofThiebaux  ( 1977).  who  omitted  data 
from  Topeka  and  then  examined  how  well  the  Topeka 
data  could  be  estimated  from  other  data  We  are 
presently  doing  this  with  both  the  isotropic  and 
anisotropic  mc'hod  and  preliminary  results  are  very 
promising 

A few  preliminary  experiments  we  have  earned 
out  with  a limited  set  of  examples  have  resulted  in 
effectively  similar  values  of  X for  fixed  m If  in  and 
X can  be  fixed,  then  the  cost  of  repetitre  estimation 
of  ii\  k from  data  from  a given  set  of  stations  be- 
comes very  inexpensive 

Ultimately,  whether  or  not  m and  X should  be 
estimated  from  the  data  or  can  safely  be  ’ fixed  ’ at 
some  prior  value  will  have  to  be  determined  with 
respect  to  the  ultimate  use  to  which  the  analyzed 
field  is  put  (c  g if  it  is  used  in  a forecast  model, 
then  one  should  determine  whether  dynamic  estima- 


tion of  X and  in  is  cost  effective  in  terms  of  belter 

lolCc.lsts) 
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APPENDIX  A 


Outline  of  the  Derivation  of  Eq.  (2.6) 

The  solution  to  the  minimization  problem  of  (2  5) 
will  be  found  by  the  use  ot  geometry  m Hilbert  space. 
By  use  of  classical  methods,  it  is  possible  to  char- 
acterize the  solution  as  the  solution  to  a paitial  dif- 
ferential equation  with  delta  functions  and  deriva- 
tives of  delta  functions  on  the  ncht-hand  side,  but 
the  present  approach  leads  simply  to  algorithms 
which  do  not  require  the  numerical  solution  to  a 
partial  difterential  equation.  The  reader  not  familiar 
with  Hilbert  spaces  may  find  Akhiezer  and  Glazman 
(1961.  pp  1-21  and  30—3**)  piovide  the  necessary 
definitions  of  Hilbert  space  norm  and  inner  product 
The  Hilbert  spaces  we  will  use  all  possess  a repro- 
ducing kcincl  which  is  used  to  constiuct  the  solu- 
tion. these  kernels  v,  ill  be  desci  ibed  bc'o  x We  wish 
to  minimize 

V 

N~'  V - z,)-trr!  + XJm(r«)  (2  5) 

.-i 

in  an  appropriate  Hilbert  space’  X of  functions  for 
wh’ch  ./,„(/.')  is  finite  We  fiist  define  a suitable  inner 
product  on  X Let  s,.  s...  . s„  be  a fixed  set  of 

M points  in  Euclidean*/  space  with  the  property  that 
v 

N (s)  = 0.  for  s = S|.  . . st( 

l-l 


implies  that  all  it,  are  0 The  particular  choice  of 
these  points  is  unimportant  as  they  wtl'  cancel  out 
later  An  inner  pioduct  (it  i > is  defined  on  X by 


M 

< n.i ) = X /((s,H(s,) 

/*i 


V 


ay 


It  follows  from  (A  I)  that  the  norm 
given  by 


(Al) 
M i,n  x is 


'The  rigorous  definition  of  \ is  \ is  the  vector  spice  of  all 
the  Schw  irt/  divtnhutions  for  whuh  .ill  tht.  pirti.il  dertv  ttivis 
in  the  diMi  ihution  d wnst.  ol  lot  tl  order  nt  aic  squ  ire  miserable 
JsCc  Munguit  |*J7¥  | q (4)} 
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= 2 w'lSj)  + •/«.<")  (A2) 

J-i 

A (real)  conlinuous  linear  functional  L defined  on 
functions  u in  X is  a functional  which  assigns  a real 
number  to  each  //  with  the  property 

L(aii  i + /3 nt)  = aLu,  + PLii2 

for  any  ti,  and  n,  and  furtnermoie  there  exists  a 
constant  C so  that 

all  n ex  (A3) 

For  the  familiar  space  L . of  functions  with 

IM?  = | • J w*(  V|,  . ,x„)clx,-  -dx„. 

*4 

Lu  = ti(t')  is  not  a continuous  linear  functional  be- 
cause (A3)  cannot  be  satisfied  However,  in  all  the 
spaces  X that  we  will  consider.  L-i  = i/(t*)  will  be 
a continuous  linear  functional  By  the  Riesz  repre- 
sentation theorem  (Akhiezer  and  Glazman,  1961. 
p 33).*  if  L,  is  a continuous  linear  functional  on 
functions  in  a Hilbert  space  X.  then  there  is  a func- 
tion -tj,  in  X.  called  the  representer  of  L„  such  that 

L,u  = (tj„m> 

Suppose  these  tj,  were  given  Then  our  minimiza- 
tion problem  is  as  follows  Find  it  in  X to  minimize 

N'x  i <<tj,.h>  - r.lVr1  + (A4) 

We  look  at  this  problem  from  a geometric  point  of 
view.  Any  u in  the  Hilbert  space  X can  be  written 

as  a linear  combination  of  tj,.  , tj <J <&„ 

plus  some  function  p which  is  perpendicular  to  each 
tj,  and  <2\.  that  is. 

V u 

u - £ c, TJ,  x £ + P (A5) 

l"|  »*l 

for  some  coefficients  c = (rt.  . . C\)\  d = (</,, 

. . . , duy,  where 

<TW»=0.  1-1.2...  .*] 

(6,.p)  = 0.  u = 1,  2 M\ 

By  substituting  (A'*)  into  (A4)  and  using  (A6)  re- 
peatedly . one  can  show  that  for  it  of  the  form  (A5) 
to  minimize  ( A4).  it  is  necessary  that  p = 0 By 
using  Lemma  15  1 in  Kinteldorf  and  Wahba  (1971), 
,t  can  also  be  established  [assuming  (2  4)|  that  the 
coelficients  < , must  satisfv 
\ 

<</>,.  t.v.)  =0.  r = 1.2.  . \t.  (A7) 


* Mstucjcr  .tod  Cli/mm  use  line  ir  lunvlton.il  for  whit  wt 
arc  ciHtnt:  vonfinutMi*  Imcir  funvlioml 


which  is  equivalent  to  (2  9),  i.e., 

V c « 0.  (2.9) 

since  (tj,,<6,)  = It  remains  to  find  the  tj,  and  the 
coefficicntsc  = (r  . . . , t .)’  andd  = («/,.  .,dv)'. 

To  find  the  tj,  we  use  the  theory  of  reproducing 
kernels.  [For  more  details  concerning  what  follows, 
see  Aronsazjn  s 1950)  and  Kimcldon  and  Wahba 
(1971)  1 A Hilbert  space  X is  said  to  possess  a re- 
producing kernel  (rk)  if,  for  each  t*  in  R",  the  func- 
tional Lit  = «'t*)  is  a continuous  linear  functional 
Then  there  exists  a representer  </,  in  X such  that 

Lu  - //«*)  = (qt  .«). 

We  define  the  fundon  (7(s,t)  of  two  (vector)  vari- 
ables s and  t by 

C(s,t)  = (t/s.r/i), 

where  Q is  called  the  reproducing  kernel  for  X The 
basic  property  of  the  reproducing  kernel  is  that  given 
Q . one  can  find  tne  representers  of  any  continuous 
linear  functionals.  The  tj,  ate  given  by 

T),(t)  = /..,„(?(  s.t).  (A8) 

and.  furthermore, 

<V,.V,)  = L*s,LM,Q{s,t).  (A9) 

where,  as  before,  the  subscript  (s)  indicates  'hat  ihe 
functional  L,  is  to  be  applied  to  what  follows  con- 
sidered as  a function  of  s 
Using  results  in  Duchon  (1976a,  1976b)  i. id 
Meinguet  (1979).  it  is  possible  to  oeJace  that  the 
reproducing  kernel  (J(s.t)  for  X with  the  inner 
product  given  by  (Al)  is  given  by 

C?(s.t)  = K(  s.t)  + /’(s.t).  (A10) 

where 

v 

/v(s,t)  = Cm( s.t)  - x p,( t)£,„(s,.s) 

l 

U 

- 2 PA s)E„(t,sM) 

M-l 

V 

+ S PA&PMEA Sm.s,). 

M | 

u 

/’(s.t)  = 3!  pds)p,(t) 

r-I 

Em(s.t)  is  as  defined  in  (2  3).  and  p,. . , pu  are 

the  '/  polynomials  of  total  degree  less  than  in  sntis- 
fying/>,(sw)  = I.  if//  = i and  is  equai  to  zero  other- 
wise To  verify  that  b?(s  t)  is  the  reproducing  kernel 
for  X it  is  sufficient  to  check  that  (</,.u)  = u(t). 
where  </,(s)  - (J(s  I).  ,nd  that «/,  is  in  \ This  can  be 
done  using  Meineiicl  (I'I79| 

Now,  Idling  z it)  K .is  in  (2  7). 

f,(t)  = s.t) 


(2.7) 
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Vui  i mi  10$ 


V,{i)  - o(?(s  0. 


I'ltc  ininmu/er  of(  \ 14)  suhicct  to  (AIM  is  obtained 
b\  setting  >t,:  0.  i V,  » I V iii  (2  SI 

...  i , . r , . Ilo.vcvci  tlio  soninut.itu  i.il  pioicdiuc  given  m 

verified  that  Appendices  C and  I)  is  not  Mutable  lot  this  case 

since  the  pi ocedurc  involves  division  by  ir,5  For  a 
, ' ' ” computational  procedure  lot  this  ca-e  see  Wjhha 

1 o, ij,(0  = 1 <,£,«)  -11  <\£,(s,)/MtL  (All)  (l9!<ib>  Scc.  6 s). 

*•1  I- 1 i- 1 r-1 

Thus,  since  the  double  sum  on  the  right  is  a poly- 
nomial. this  establishes  that  the  mimmizer  of  (2.5) 
has  the  form 


«v.«a(0  =-  1 <\f,(t)  + 1 </A(0-  (2.6) 

l»  I r~  I 

The  coefficients  c r,nd  d are  obtained  as  follows- 
Since  1 r, ij„  1 r.f,  and  differ  by  poly- 
nomials, , 

/-(l  clVl)  = JJ  1 c,f.)  = Jm{n,.mk). 

By  (All),  < .This,)  = 0.  v = I,  2.  ....  A/,  so 
by  use  c.’(A2)  and  (A9) 

A 

Jm(  1 Pi~n) 


= 111  r.n.l!1  =11  c,t,LlWLMQis.\).  (A12) 

.-i  <-i  j-i 

Using  (A7)  and  ( MO)  it  can  be  shown  that  the  right- 
hand  side  of  (A  12)  is  equal  to 


APPENDIX  B 

Example  of  Ihe  Calculation  of  K and  T 
for  L,  (moiling  Differentiation 


Let  d = 2,  m = 3 


L,i,  » — 

dTy 


Lku  - 


du 

dv. 


it{  jSl 


.xt  xli 


V V 


1 t'f*L.uiZ-«nE-'s’t)  " c’Kc- 


Using 


then 

f/X|.X,)  = — - Ej(l|.*i.  t|.Tj)!v,.x>, 
d>i 

= -r-T‘(-v«  - «•>*  + " T*)S1’ 

d>-,  2 

x ln[(y , -»,)*  + (»'--  a,)'1| 

- <M2lUi  - y,)*  + (t'2  - t-n 

x (vJ,  - x,)  !n|(r',  - »,)’  + (ajj  - x,)1] 

+ l(»J.  - x.P  + (»'.  - ijPKti  - t,)}. 

a5  a, 


L|/<\  m JL 


E;lilEitllE“^*^ 


dv,<3r, 


Id  i — ii)5  + (v’j  - xj)1] 


= k c + rd. 


etc 


x ln[( \ , - v,)5  + (t1,  - xSl’lU.-j--, 

r.-x* 

a , , 

dr, 


APPENDIX  C 

Culculation  of  V,„(  A)  and  Its  Minlmlzcr, 
and  of  c and  d 


f \t/  \ m i 

Eq  (A4(  is  equal  to 

N 'iKc  + Fd  - z)'Da~*{Kc  + Fd 

- z)  + Ac’A'c.  (AH) 

Minimization  of  this  cxpres.ion  witn  respect  to  c 
and  d gives  the  desired  equations  for  c and  d,  i e . 

( K + + 7d  = z.  (2  8)  Calculation  of c.  d and  \ „tA) arc  based  on  formulas 

(Cl)-(C3)  below  These  formulas  arc  derived  in 
Fc=0  (2.9)  AppCndlxD 

Wc  close  this  Appendix  with  the  observation  that  c = RiR'KR  + .VA R'D.'R)  'R'z  (Cl) 

one  can  aKo  enforce  strong  constraints  by  the  same 

met.  ods  Suppose  one  wishes  to  minimize  d ^ (z  _ 

(c  and  d have  originally  Ivecng’ven  in  (2  8)and(2  9)) 
N~'  1 ( Lku  - z*),<rt‘’  f A Jm(u),  (A14)  / _ 

= S'\r>„-RUVMt  + VA « /)„’«)  '«'.  (C3) 
suljcct  to  where  A’  is  anv  ,V  a ,V  - \f  dimensional  matrix  of 

L)i‘  ~ ~i.  J ~ A',  + 1,  . . . ,V.  (AIM  rank /V  - W satisfying  R1  = 


y « AT,  + I,  . . . /V. 


(AIM 
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It  is  shown  in  Appendix  D that  the  N - M x N - M 
dimensional  matrix  B defined  by  B - R'KR  is 
always  strictly  positive  definite  (although  K may  not 
be)  This  allows  some  of  the  calculations  below  to 
proceed 

We  now  discuss  a computational  procedure  which 
we  have  successfully  implemented  for  the  special 
ease  d = 2,  L,u  = u(t,|.  ir,-  o a-.  in  - 2.  3,  4,  5 
or  6.  and  N « 120 

R can  always  be  chosen  so  that  R'P.rR  = /\.u 
where  /x  is  the  A - M dimensional  identity 
matrix  This  is  done  numerically  as  follows  I ct 
T = D„'lT  and  form  the  matrix  C = / - T(T'T)~'T' 
This  symmetric  non-neg.'tive  definite  matrix  is  a pro- 
jection matrix  of  rank  A’  - M satisfy tngT'C  = 0W/>\, 
and  so  it  has  N - M eigenvalues  equal  to  1 and  M 
eigenvalues  equal  to  0 The  N - M eigenvectors 
r„  ft,  , say,  corresponding  to  the  ones 
have  the  property 

Tfj  = 0,  / = 1.2 N - A/, 

and  the  property  that  the  A'  x N - M dimensional 
matnx  R with  columns  f,.  . , rv_„  satisfies  R'R 

= /.  The  eigenvectors  corr-sponding  to  the  ones  are 
not  individually  uniquely  defined  of  course,  any  set 
will  do  Let  rj  = D„~'r,  and  R = Dg~'R.  Then  T’r , 
= T'fj  = 0.  j = 1.  2.  . . . iV  - M and  R'D„2R 
= R’R  = /.  Thus  R is  the  desired  matrix  We  suc- 
cessfully used  E1SPACK  (Smith  cl  til  , 1976)  in 
double  precision  to  deliver  the  {r,}  given  C,  for  N 
up  to  ~ 1 20  Once  R is  determined,  let  the  eigenvalue 
decomposition  of  B = R'KR  be 


where  w = (w, h’\-\iY  - U'R'z.  For  fixed  m, 

given  the  u4*  and  the  hi.  it  is  not  hard  to  find  the 
A minimizing  the  right  hand  side  of  this  expression 
by  global  search.  Ii  is  convenient  to  w'orh  in  units 
-of  logA. 

APPENDIX  D 

Derivation  of  (Cl),  (C2)  and  (C3) 

We  obtain  (Cl)  and  (C2)  from  (2.8)  and  (2.9). 

c = RIR'KR  + NkR'DJRr'R'r,  (Cl) 

d = (TD.-'Tr'T'D.-Hz  - Kq),  (C2) 

( K + N\DS)q  + Td  = z,  (2.8) 

rc  = 0.  (2.9) 

Here  R is  any  N x N - M matrix  of  rank  N - M 

satisfying  R'T  = 0.  Since  T'c  = 0 there  exists  a 
unique  N - A/  vector  y,  say,  with 

c = Ry.  (Dl) 

Multiplying  the  left  side  of  (2.8)  by  R ' and  substituting 
in  (Dl)  gives 

R'(K  + N\Da*)Ry  = R'z  ) 
y = [R'(K  + NXDsW'R'zl  ( 

and  multiplying  the  left  side  of  (D2)  oy  R gives  (Cl). 
To  get  (C2)  we  multiply  the  left  side  of  (2  8)  by 
T'Da--  to  get 

T'Da-*Kc  + T'Da-*Td  = T'Dr'z.  (D3) 


B = UD„U\ 

where  U is  orthogonal  and  D„  is  the  diagonal  matrix 
with  diagonal  entries  the  eigenvalues  h,,  of  B , 
i = I.  2. . , ;V  - A/  U and  Ph  are  again  obtained 

by  EISPACK  The  h,  are  theoretically  all  positive. 
Then  c is  readily  computed  from  the  identity 

c = RU(Dt  + .VA/)-'Wi 
and  d is  computed  from 

d = ( T‘t)-'TDa-'(z  - Kq). 

By  (C3)  we  obtain 

Da~'(!  - Am) z = 
and  using  (3.9)  we  have 
VJA) 


A'~ '(iVATlSD, /?(’(/>),  A'A/)-’w[|; 

A/-;(A'A):[TraceK  P.,R[B  + N>  R'PJR)~'\: 


\-u 
V • V 


(/»,  1 \A) 


l ' w I 

/v-‘  v — ! — 

l ,7,  h,  + /V A / 


Finally  we  multiply  the  left  side  of  (D3)  by 
(T'Pg-'T)-'  to  obtain  (C2). 

To  obtain  (C3) 

/ - Am{  A) 

= N\.Dg,R(R'KR  + Nk.\'Da2R)~lR' , (C3) 
it  is  necessary  to  know  that 

£*f/  = Li{t)LjWEm{  t.s). 

This  is  not  hard  to  check  from  the  definitions.  Then 
one  has 

it  u 

1-1  •— I 

k = 1.  2 N, 

or 

m k \ 

L*lk  - v 1 = Kq  + Td. 

/•\ 11  \ v / 

and  by  the  definition  of  A„(A).  we  have 
Kc  + Td  = Am(  A)z, 
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Thus, 

1/  - A JX)1/  - z - Ac  - Td  (D4) 
But  from  (2.8) 

z - Ac  - 7d  = NkD „x.  (D5) 

Substituting  (Cl)  into  (D5)  and  the  result  into  (D4) 
gives  (C3). 

We  now  give  a brief  argument  why  the  N - M 
x S - A/  matrix  It  = R'KR  is  always  strictly  posi- 
tive definite.  Let  A'„  and  R„  be  the  special  cases  of 
A'  and  R when  Ltu  - n(lt)  Duclion  (1976b)  has 
shown  in  this  case  that  R„'K„R„  is  always  strictly 
positive  definite  for  an1'  N 3-  A/.  By  using  the  fact 
that  all  continuous  linear  functionals  in  a reproduc- 
ing kernel  Hilbert  space  are  limits  of  sums  of  evalua- 
tion functionals,  one  can  show  the  positive  definite- 
ness in  general  (see  Dyn  and  Wahba  (1979)  for 
more  details]. 


APPENDIX  E 

500  mb  Height  Model 

As  mentioned  earlier,  the  height  field  used  in  the 
numerical  experiments  is  the  same  as  that  used  by 
T.  Koehler.  Koehler  adopted  the  model  of  Sanders 
to  represent  meteorological  phenomena  of  interest 
(in  particular,  we  used  pressure  surfaces)  over  an 
area  the  size  of  North  America.  In  his  model  the 
height  : of  any  pressure  surface/?  at  longitude  (land 
latitude  <£>  is  defined  as  follows: 

i(0.<f >./>)  = i COS[(2ff/L»)(0(1  - 0 + A0)]G'(<f>)  + z 

+ ir«(1000)/y][l  - (p/lOOO)'"1’'”1] 

- (K/g){ln(1000/p)  - (a/2)(ln(  1000/p )]*) 

x {(flr/sind>„)(cos<i>o  - costj) 

+ t cos((2n/L,)(0„  - 0)]G(d>)}, 

where 

rjlOOO)  = 278  K 
t = 10  K 
z = 150  m 
z = 90  m 
Rylg  = 0.0953 

a = 0.9  x 10's  K m-' 
r = 6371  km 
A0  = 9° 

L,  = 30° 

<50  = 45e 
a = 0.621 

R = 287  04  ml  sl  K_I 
S = 9.8  m s‘s 
p = 500  mb 
0O  *=  ALON0 


Also 


G(</>)  = - (</>  - </>«)j  -t  i — (A  - d»„)j 

+ i/J  — (4>  - 4>,<)  j + c. 


with 


and 


b = - 1/60.  J = -'•0/60, 
t = 11/60,  c = I, 


G'W  = j 


sim/.'  0G(<#>')  . „ 

dtp  . 

e.  sind>o  0<t>' 


In  the  numerical  experiments  the  parameter 
ALON„  was  varied  taking  the  values  105,  100,  95 
and  90  This  parameter  determined  the  longitude 
at  which  the  wave  “begins".  Hence,  by  decreasing 
ALON0  the  wave  "moves"  from  west  to  cast.  For 
the  physical  interpretation  of  the  other  constants 
and  functional  form  of  the  model  the  reader  is  re- 
ferred to  Koehler  (1979) 
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